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To My Father and Mother 


Editor’s Introduction 


Although measurement is as old as man, its refinement and usefulness 
for educational purposes is much more modern. Today’s schools find a wide 
range of needs for measurement. In facilitating learning, in improving in- 
struction, in effective counseling and in educational placement measure- 
ment plays an important role. Pupil, parent, teacher, supervisor, principal, 
superintendent and board of education member all find measurement indis- 
pensable to the total educational program. Consciously or unconsciously 
measurement is employed by the general public in its evaluation of our 
schools. 

If the purpose of education is to achieve desirable change, then it is 
necessary to know with some degree of exactness the relationships among 
various educational procedures, the aptitudes of learners, and the changes 
in human behavior that result. Accurate prediction and control in the 
educational process are prime considerations of educational measurement. 
What changes in behavior are desirable? How may these changes be 
measured? What aptitudes are essential to the development of an accepted 
form and level of behavior and the crucial elements in the educative process? 
The contribution and value of educational measurement are inextricably 
interwoven with the validity of the answers to these questions. 

In his Introduction to Educational Measurement Professor Victor H. Noll 
has aimed at writing a text that is both scientific and practical, one that is 
useful and understandable both to the experienced educator and to the 
beginner. The task has been a challenging one, the outcome gratifying. 
In the chapters which make up this illuminating and interesting presenta- 
tion, measurement takes on a reality and vitality to which it has long been 
entitled. The author’s presentation of the applications of measurement 
and its contribution to the educational program will interest and serve the 
entire corps of school personnel and the layman eager to know more about 
this important educational tool. 

"Thexeuthor is well qualified for his undertaking. For more than twenty- 
five years he has been teaching courses and carrying on research in meas- 
urement. He has assisted many teachers and school systems in planning 
and executing measurement programs. In his Introduction to Educational 
Measurement he has produced a readable and authoritative text which 


should establish itself as a leader in the field. 


* Hznorp C. Hunt 


Preface 


This book is the outgrowth of a strong conviction of the importance of 
measurement in the educational process and of sincere respect for what it 
has already contributed to the improvement of education. It is written in 
the belief that modern education could not and would not have progressed 
to its present status without the help of measurement and that measure- 
ment is essential to future progress in education. 

A second factor influencing the writer of this volume is the conviction that 
many teachers lack adequate preparation and competence in measurement. 
A recent survey of measurement courses in eighty collegiate institutions of 
various types clearly suggests that teachers are not receiving adequate 
training in measurement. Among these eighty institutions, all of which 
offer work in education, it was found that only sixty-six offer an intro- 
ductory course in educational measurement, only fourteen of the sixty-six 
require the course, either of graduates or undergraduates, and only nine 
require such a course of all undergraduates preparing to teach. Similarly, 
a recent analysis of requirements by the states for teachers’ licenses or cer- 
tificates shows that only five states require a course in measurement for a 
teacher's license, five states recommend such a course, a few recommend or 
require it for administrators’ or counselors’ credentials, and only one state 
lists such a course as a requirement for all teachers’ certificates. 

Third, and perhaps most significant to the writer, are certain impressions 
firmly established as a result of more than a quarter century of experience in 
teacher education. Briefly, they are (a) that most teachers and counselors 
are genuinely interested in learning about measurement once they have 
made a beginning and once some misconceptions and prejudices have been 
cleared away, (b) that it is possible to give at least a foundation in measure- 
ment in a one-quarter or one-semester course, and (c) that it is possible in 
such a course to develop an objective, realistic viewpoint concerning the 
place, nature, and purposes of measurement in modern education. 

Implicit in all this is à crucial challenge to instructors of educational 
measurement. lf interest and growth in this field are to continue and in- 
crease, the introductory measurement course must function; that is, it must 
prepare teachers to perform skillfully the kinds of measurement and evalua- 
tion tasks that effective teaching requires. Introduction lo Educational 
Measurement has been written in recognition of this challenge. It is the 
writer’s hope that this book will make its contribution to the growth of edu- 
cational measurement by stimulating the interest of students and teachers 
with a simple, functional approach. The book attempts to provide an 
orientation to the field of measurement and evaluation in education, a 
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foundation in measurement and statistical theory, and a wide acquaintance 
with appropriate instruments, procedures, and sources of information. 
Another primary aim of Introduction to Educational Measurement is to de- 
velop in students some understanding and skill in constructing measuring 
instruments for their own purposes. 

This book is intended for teachers at both the elementary and secondary 
levels. The basic theory and principles are much the same, regardless of the 
level at which they are applied. An attempt has been made to provide exer- 
cises and illustrations sufficiently varied to meet the needs of teachers and 
others at all school levels, and there are separate chapters dealing with the 
measurement of achievement in elementary.and secondary schools. The 
book should be useful not only to teachers and prospective teachers, but also 
to school psychologists, counselors, and administrators, and those engaged 
in, or contemplating, educational research activities. 

The contents of Introduction to Educational Measurement follows a plan 
of organization closely related to the purposes set forth above. Chapters 1 
and 2 are essentially introductory, presenting a point of view and some his- 
torical background. Chapters 3 and 4 include the fundamental statistics 
and measurement theory which the average classroom teacher needs and 
can absorb with little or no background or experience. Chapters 5, 6, and 7 
present the fundamentals of test construction with special attention to the 
improvement of the teacher's own classroom measurements. In Chapters 
8 and 9 achievement tests at the elementary and secondary levels in the 
main branches of instruction are surveyed. No attempt is made to include 
all tests or tests in every subject; rather, representative tests in the major 
fields are closely examined, and others well-known in those fields are listed. 
This approach will enable the prospective teacher to gain a familiarity with 
the various types of instruments available, and should prepare him to make 
intelligent choices among such instruments and to interpret the results 
obtained by their use. 

Chapters 10, 11, and 12 deal with the measurement of capacity and per- 
sonality. While the major emphasis throughout the book is on measuring 
achievement, a knowledge and understanding of intelligence and aptitude 
tests, personality and interest inventories, rating scales, and other such 
instruments is considered a necessary part of the teacher’s, and particularly 
of the counselor’s, orientation in this field. Chapter 13 deals with the organ- 
ization and administration of measurement programs suited to the peeds 
and resources of school systems of various sizes. Chapter 14 brings to- 
gether, summarizes, and enlarges upon the uses of measurement referred to 
in previous chapters, and suggests a number of additional applications. 

A word about Chapter 3 and Appendix A will be appropriate here. Sta- 
tistical methods are the bugbear of most students in measurement courses. 
Many have a well-established emotional block when the words “statistics” 
or “mathematics” are mentioned. Chapter 3 is designed to help overcome 
this block by an attempt to present, simply but soundly, the most elemen- 
tary processes for interpreting test scores. The chapter is not written for 
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Preface ‘xi 
statisticians, but for beginners in educational measurement. It deals with 
ungrouped data, and involves only the fundamentals of arithmetic and 
square root. The chapter should serve to give students a real meas- 
ure of confidence in handling test scores For those who wish to go 
beyond the statistics presented in ‘Chapter 3, Appendix A is provided. 
There, the ideas of Chapter 3 are expanded and brought to a level which 
will meet the needs of counselors, guidance workers, school psychologists, 
those in charge of testing programs, and teachers in service who desire a 
more thorough knowledge. The instructor in the course will decide whether 
Chapter 3, Appendix A, or both, should be used. 

So many persons have played an important part in the preparation of this 
book that it would be futile to try to mention them all. The writer is 
indebted to the many students who have been members of his classes in edu- 
cational measurement over the years, and whose interest, application, and 
questions have served to sharpen and clarify his own thinking. A number 
of these have been graduate students and colleagues working for advanced 
degrees. Their ideas and suggestions have been very helpful. 

The writer is indebted also to the many publishers of tests and other ma- 
terials in this field for permission to reproduce samples and to quote from 
manuals and books, and to the public school systems that permitted descrip- 
tion of their record systems or testing programs. 1 

Sincere gratitude is due Dr. Robert L. Ebel, Vice President, Educational 
"Testing Service, and former director of the Bureau of Educational Research 
and Service of the State University of Iowa, for a most careful and helpful 
reading of the entire manuscript. Dr. Ebel's comments and suggestions 
were consistently constructive, and they have added substantially to what- 
ever merit the book may possess. It should be added that any faults or 
shortcomings in this book are the responsibility of the writer. 

Acknowledgment of faithful and efficient service is due two expert typists, 
Mrs. Maureen Choate and Mrs. Dorothy Otto. I wish also to acknowledge 
with deepest appreciation the substantial contributions of my wife, Rachel 
Perkins Noll. She typed most of the fizst draft, read the entire manuscript 
and some of the proof, and assisted in innumerable other ways. Her con- 
stant encouragement and willingness to help at all times have been a real 


inspiration. 
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Educational Measurement: 


An Overview 


WHY MEASUREMENT IN EDUCATION? 


Measurement goes on constantly everywhere, and it seems reasonable 
to suppose that the process has been used for a long time. In making his 
first suits of clothes, primitive man undoubtedly performed crude meas- 
urements when he selected and cut a large animal skin for himself and 
smaller ones for his mate and children. When he built a shelter, he probably 
used his own height as a measure, or if he sought a cave, he looked for one 
large enough for him and his family. As man came into contact with other 
men and began to exchange and share, it is likely that he devised crude 
standards of measure for purposes of barter. 

Earliest records indicate that by the dawn of recorded history man had 
developed some systems of measurement. The ancient Egyptians must 
have had fairly accurate methods of measurement to build the pyramids. 
The Book of Genesis states that Noah built his ark three hundred cubits 
long, fifty cubits in breadth, and thirty cubits high.' There is also reference 
in the same source to the weighing of gold and silver. 

The ancient Greeks and Romans had well-developed systems of measure- 
ment and performed very exact work in constructing roads, bridges, build- 
ings, arches, and monuments. Before modern times, units were developed 
and adopted which have come down to this day, in name at least. The 
foot*Was based on the length of the human foot, the pennyweight (weight 
of a penny) was equal to 32 wheatcorns in the midst of the ear, and the 
ounce was 20 pennyweights.? 

Some early attempts at standardization were necessary since, for example, 
the foot unit was not the length of just any man's foot, but of one particular 


1A cubit was the distance from a man's elbow to the tip of his outstretched middle 


finger, a distance of about. 18 inches. 
? William Hallock and Herbert T. Wade, The Evolution of Weights and Measures and 


the Metric System (New York: The Macmillan Company, 1906), Chap. 1. 
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foot —that of the king. Naturally, if the king died or was deposed and a 
new one came to the throne, the length of the foot might change also. The 
sheer inconvenience of having variable units and, of course, the development 
of scientific methods made constant units and exact measurement a neces- 
sity. Since natural units tended to vary, it became the usual practice to 
arbitrarily fix unit measures by law or royal decree. Ultimately this process 
reached its highest level in the development and general acceptance in most 
European countries and in scientific work everywhere of the metric system 
of weights and measures. In this system natural but stable measures formed 
the basis for units which were adopted and made permanent standards by 
international agreement.’ 

All of us, every day, practice measurement in countless ways. Nearly 
everything — gasoline, potatoes, lumber, yard goods, etc. — is sold and 
purchased by measured amounts. When we travel, distances are expressed 
in units of measure; growth of plants or animals can be measured accurately ; 
the passage of time is measured in hours, days, years, or centuries; when 
we make estimates of time, distance, weight, size, color, texture or countless 
other traits, qualities, or amounts, we are making measurements. 

The remarkable scientific discoveries and advances of the past few hun- 
dreds of years have brought about and depend squarely upon accurate 
measurement. The work of the chemist, the physicist, the biologist, and 
the astronomer could not be done without accurate measurement. Volumes, 
weights, temperatures, pressures, speed, time — all are measured with 
great acouracy in scientific work. The astronomer calculates and predicts 
the exact time to the second, years in advance, of the solar eclipse; the 
chemist weighs quantities so small that they can be seen only under magni- 
fication; he determines the composition of substances with methods of anal- 
ysis that are amazingly precise; every scientific worker employs measuring 
techniques and instruments that are, on the whole, sufficiently accurate 
and dependable for the purposes at hand. And some of these are sur- 
prisingly close. For example, as complicated a mechanism as an automobile 
is manufactured in thousands of parts, each made to such exact specifica- 
tions that millions of cars are assembled without any major adjustment 
being necessary before they are driven off the end of the assembly line. 

As measurement became more accurate and more commonly used“i was 
inevitable that refinements in technique would affect educational measure- 
ment also. To be sure teachers have had the responsibility of judging and 
appraising ever since teaching began. They could not fail to note that 
some individuals learned more easily and rapidly than others; that some 
learned more of what was taught and retained what was learned far longer 
than others. It has long been part of the responsibility of teachers not only 
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to note and estimate the amount of such differences, but also to make re- 
ports on them and on pupil progress. | As new and more refined methods of 
measurement were discovered and developed in natural science they gradu- 
ally influenced method and thinking in other fields, including education. 
This development in education is of comparatively recent origin, but it has 
had tremendous effects on learning and teaching and all that concerns the 
school. 

In order to judge the attainment of his pupils accurately and fairly a 
teacher must have accurate measurement techniques at his disposal, must 
know how to use them properly and how to interpret results obtained by 
their use. For example, in order to make satisfactory judgments of a child's 
achievement one must know accurately both his past achievement and his 
probable ability to achieve. A child with low achievement may be doing 
all that he is capable of doing, though actually achieving less than one 
whose accomplishment is much greater but who may not be using nearly 
all of his ability. Again, a pupil may have special abilities or aptitudes that 
result in great irregularity or unevenness in his day-by-day attainment. 
He may do good work in language, history, and literature, but his work in 
arithmetic and science may be distinctly below average. Accurate, depend- 
able measurement is indispensable to interpreting and dealing with such 
situations. And, of course, teachers must make daily, or at least frequent, 
appraisals of the work of every pupil in terms of the objectives set up in 
order to see what progress each pupil is making toward such objectives in 
the light of his abilities. To do this a teacher should have at his disposal 
the widest possible range of dependable, accurate measuring instruments 
and techniques. 

It is important that every teacher, counselor, or school psychologist have 
a thorough knowledge of available measuring devices and techniques as 
they apply to his work, and that the appropriate and proper use of these be 
well understood. Moreover, especially in the case of the teacher, there will 
be many situations in which he will want to devise tests and examinations 
of his own, particularly where none is available that meets the needs of his 
situation accurately. Then he will have use for knowledge and skill in the 
constryction of appropriate measurement devices so that the requirements 
inherent in his particular situation can be accurately appraised. Measure- 
ment devices or techniques prepared by the teacher are often the best, and 
sometimes the only, means of determining how well classes or individual 
pupils are progressing toward the objectives of instruction. For example, 
when it is desirable to know how well a child is learning to work and play 
with others, or what changes have taken place in his attitudes toward the 
other children, the teacher must often devise his own methods of evaluating 
the child's progress toward such goals. 
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(Today measurement touches upon and influences every phase of educa- 
tion. Whether it be marking, promotion, guidance and counseling, curricu- 
lum development, instruction, or some other aspect of the work, measure- 
ment enters the picture and usually plays an important role. However, 
while recognizing the pervasiveness and usefulness of measurement in edu- 
cation, the student and teacher must always keep in mind that measurement 
is a tool, a means to an end, not an end in itself. [We do not generally 
measure anything just for the sake of knowing how long or how heavy it is. 
We have a use for the knowledge, or some reason for acquiring it. We 
measure the dimensions or weight of an object to see whether it fits, whether 
we can carry it, or how much it will cost. Perhaps more important, we 
measure the object to determine whether it will serve a specific purpose. In 
education we measure the capacity (or fit) of an individual so that we can 
help him evaluate his strengths and weaknesses as he develops skills and 
knowledge in acquiring his place in life and society, or we measure to see 
to what degree the purposes, goals, or objectives of education have been 
attained and the extent of the student’s progress toward them. Always, 
measurement in this field should serve as a means of helping to do a better 
job of educating people. 


e Learning Exercises * 


1. Compare methods of measurement in Biblical times with those of the ancient 
Greeks and Romans. Name some important respects in which present-day measure- 
ment is superior to that of earlier times. 

2. List at least ten ways in which people are involved in or use measurement, 
today. Can you think of any activities in which you engage that are not touched 
by measurement or evaluation? 

3. Analyze the responsibilities of a second-grade teacher in the area of measure- 
ment and evaluation. Compare them: with those of a teacher of English in the 
ninth grade. A teacher of homemaking in senior high school, A student teacher 
in your own major. 


EDUCATIONAL MEASUREMENT TODAY 


At the turn of the century standardized tests as commonly used today 
were unknown. The first such school-subject test to be published or made 


4A standardized test is one that has been carefully constructed by experts in the light 
of acceptable objectives or purposes; procedures for administering, scoring, and inter- 
preting scores are Specified in detail so that no matter who gives the test or where it may 
be given, the results should be comparable, and norms or averages for different age or 
grade levels have been pre-determined. 
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generally available was one in arithmetic by Stone. By 1920 the first 
standardized tests of intelligence and of personality, a number of tests in 
school subjects, some aptitude tests and some general survey tests of school 
achievement had made their appearance. Also, the first books on educa- 
tional measurement and on statistical methods as applied to education were 
published during this period. Some of the significant developments in the 
field of educational measurement during these early years will be described 
in the next chapter, but at this point it will suffice to say that by 1920 the 
use of tests and scales, including rating scales, score cards, check lists, and 
other measuring instruments, had become well established in the schools and 
colleges of the United States. In 1930, Odell ê cites data collected several 
years prior to publication (perhaps 1925-26) indicating that somewhere be- 
tween thirty and forty million standardized tests were sold in the United 
States during one of these years. Data collected about twenty years later 
show that some sixty million standardized tests were being used in this 
country annually.” 

Another kind of evidence of the increasing use of tests is the establishment 
of organizations for the development, standardization, and distribution of 
educational measuring instruments. This is an important part of the 

- business of at least twenty-five organizations in this country, and there are 
some seventy-five others which publish at least one or more tests or other 
measuring devices.* 

Standardized tests of ability, aptitude, achievement, and personality also 
find wide use in business, industry, and the military services. Scarcely any 
large business organization or industrial concern could function today with- 
out a personnel manager or director and staff of trained personnel techni- 
cians or psychologists. Much use is made of standardized and special tests 
in selecting persons for employment, determining what type of work they 
are fitted for, and measuring their efficiency. In addition to educational, 
business, and industrial uses of measurement materials, the armed services 
use large quantities of tests, rating scales, and other measuring instruments. 
Men and women are tested at pre-induction centers and during recruit 
training for selection purposes, for special training, for fitness for unusual 
kinds of duty, for adjustment to the service, and for countless other purposes. 
The first group tests of intelligence, Army Alpha and Army Beta, were de- 
vised especially for use in World War I, and the development and use of all 
kinds of measuring devices were greatly expanded in World War II. 


5C. W. Stone, Arithmetical Abilities and Some Factors Determining Them, Contribu- 
tions to Education, No. 19 (New York: Teachers College, Columbia University, 1908). 

6C. W. Odell, Educational Measurement in High School (New York: The Century 
Company, 1930), 641 pp. 

7 The American Psychologist, 2:26 (January, 1947). 

30. K. Buros, Fourth Mental Measurements Yearbook (Highland Park, N.J.: The 
Gryphon Press, 1953). 
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It is difficult to imagine how education, business, industry, the military, 
or any program of research in these fields could function today without 
measurement. Measurement has assumed the proportions of “big business,” 
both in usefulness and in size. No other development of modern times has 
contributed so much to our understanding of the nature and extent of in- 
dividual differences among boys and girls, men and women. The structure 
and procedures of modern education rest upon the knowledge that individu- 
als differ in every conceivable way, and upon the realization that it is the 
responsibility of the schools to identify these differences, measure them, and 
try to fit the educational program to them. The ideal is an education fitted 
to the individual, and not the individual to a set program of education. 
Progress toward this goal without measurement is at worst impossible and 
at best an intolerable hit-or-miss process. 


ESSENTIAL CHARACTERISTICS OF MEASUREMENT 


According to Webster’s New International Dictionary, to measure is “to 
determine the extent, dimensions, quantity, degree, or capacity” of a 
thing. All of these terms imply a result expressed in numbers rather than 
descriptive phrases. When we measure a thing we express our findings in 
units of length, weight, etc. To say merely that an object is flat or round 
or green or heavy does not satisfy the quantitative aspect of the definition. 
Since measurement is a quantitative process, the results of measurement are 
always expressed in numbers — so many feet long, so many degrees of tem- 
perature, so many quarts or pounds. 

In the second place, measurement is expressed, insofar as possible, in 
constant units. When the unit of length, the yard, was the distance from a 
man’s nose to his outstretched finger tips, it was not a constant unit. Some 
men had longer arms or noses thar’ others. The stone, the English unit of 
weight, could be a large stone or a small one. Men soon came to realize 
that such variation in measuring units led to endless trouble and confusion, 
and they eventually reached agreement on certain units. Two systems of 
measurement came into existence: the English system with which«ve are 
all familiar, and the metric system which is used in most European coun- 
tries and in scientific work the world over. Units are constant in both sys- 
tems. An inch is an inch everywhere and a gram is the same in Paris as it is 
in Chicago. Standards such as the standard meter bar in Paris are used to 
calibrate or check other measuring instruments. Varying conditions some- 
times affect the accuracy of the unit measure. An object weighing a pound 
at sea level weighs less on top of a mountain, and a steel rule at 100? C. is 
longer than one at 20°C. Therefore, in order to have strictly constant units 
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we must have constant conditions. However, for all except the most exact 
scientific work such fine distinctions are not necessary. Ordinary measure- 
ment requires that the units be only relatively constant. 

The idea of constant units also implies that the unit of measurement is 
exactly the same at all points on the scale. The difference between 95° and 
96° must be the same as the difference between 10° and 11°; a centimeter is 
exactly the same at all places on a single meter stick and on all meter sticks. 
In other words, the unit of measurement does not vary in different parts of 
the scale. 

These two characteristics, namely, quantitativeness and constancy of 
units, are fundamental to all measurement. It must be recognized, of 
course, that the degree to which they are attained varies from one field or 
area to another. They can be obtained in physical measurements, for ex- 
ample, to a higher degree than in the measurement of mental or emotional 
traits, at least in the present stage of development of measuring techniques. 
It is also well to keep in mind that the degree of constancy of units is governed 
in large part by the situation. Constancy is often a relative term. There is 
less need for exact equivalence of units on a carpenter’s rule or a household 
thermometer than on the extremely fine instruments used in laboratory 
research. 


e Learning Exercises © 


4. Define the terms quanlilalive and qualilalive, as used, for example, in chemis- 
try. Apply the distinctions thus expressed to educational measurement. 
5. What is implied by the phrase constant uni? Give some illustrations of the 


concept as it applies in educational measurement. 


ERROR IN MEASUREMENT 


Occasionally everybody makes mistakes in arithmetic or in reading a 
scale or measuring the dimensions of a room. However, the concept of error 
in mewsurement has a somewhat different connotation. Suppose one were 
conducting an experiment which required the recording of temperatures. 
One might read the thermometer as accurately as possible and follow all 
directions closely, and yet there would be some degree of error in the results. 
Why would this be so? To answer this question let us consider the nature 
and sources of error in measurement. They are chiefly of three types. 

The first is the error of observation, sometimes referred to as the human 
equation. This type of error has not always been recognized. It issaid that 


one observer in a world-famous astronomical observatory back in the 1800's 
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was discharged because his observations consistently differed from those of 
his co-workers. It was not known then that such differences, though pos- 
sibly the result of carelessness on the observer's part, were often due to 
differences in how the independent observers actually "saw" the instru- 
ment readings. It is now a well-known fact that even highly trained ob- 
servers may observe the same phenomenon at the same time and yet differ 
in their reading of a scale or in their description of what took place. More- 
over, a single observer's own readings will commonly be found to vary from 
one observation to the next, even though the actual conditions are un- 
changed. 

The second source of error is inherent in the measuring device or instru- 
ment. Variations from one instrument to another — slight and perhaps im- 
perceptible variations in units of the scale, and similar mechanical variations 
— result in measuring instruments which are something less than infallible. 
Tn spite of all the painstaking care with which scientific measuring tools are 
made and calibrated, there is none that is perfect. However, the more care- 
fully the device is made and the better the materials used, the smaller the 
amount of error is likely to be. This source of error is especially significant 
in educational testing devices for reasons that will be discussed later. 

'The third source of error in measurement stems from lack of uniformity 
in what is being measured. Whether one is measuring the strength of a 
piece of twine or the performance of a child in arithmetic, some degree of 
uniformity is vital to accurate measurement. The strength of the twine 
varies in different segments of the samples and according to age, moisture, 
and other factors; the behavior of the child varies somewhat according to 
his motivation, the physical conditions of the room, and perhaps his health 
and mood at the time. For practical reasons it is generally impossible to 
measure all of a product or material such as a carload of ore, or all of an in- 
dividual's behavior or knowledge under all conditions. Therefore, an at- 
tempt is always made to measure a’representative or typical sample of the 
material or behavior. 

A knowledge and understanding of the possible sources of error is essential 
to an intelligent use of measuring instruments in any field, and through such 
knowledge and understanding we improve our methods of measuring. We 
can also make more intelligent use of the results of measurement, for know- 
ing the limits of accuracy is very helpful in making proper interpretations. 
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6. Read a thermometer, estimating the reading to the nearest tenth of a degree. 
Record your result but do not show it to anyone. Have someone else (or several other 
persons, if possible) make a reading immediately, and write down his result. 
When finished, compare the results. 
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T. Compare the readings of two thermometers, two balances, two yardsticks, or 
other pairs of measuring instruments. Do they agree? How closely? 


THE NATURE OF EDUCATIONAL MEASUREMENT 


In the light of the foregoing discussion of some of the chief characteristics 
of measurement in general, it might be well to consider how they apply or 
do not apply to measurement in education. 

First, it may be said that measurement in education is quantitative, 
otherwise it cannot properly be called measurement. By use of educational 
measurement we get scores, norms, /.Q.'s, averages, etc., all of which are 
numerical expressions. Not all methods of appraisal or evaluation in edu- 
cation are quantitative, but those which are not cannot properly be classified 
as measurement. 

Second, in the development of educational measuring devices substantial 
progress has been made toward constancy of units. This aspect of the 
problem cannot be discussed without getting into technical matters which 
are out of place in this chapter. It may be said, however, that the develop- 
ment of certain types of derived or transmuted scores such as T-scores or 
standard scores represents at least an approach to the establishment of 
constant units of educational measurement. It should be said again, how- 
ever, that “constant” in this case is a relative matter and that there are 
few really constant units of measurement in any field. 

Third, error is present in educational measurement as it is in all fields of 
scientific research. Yet no sensible person would advocate discontinuing 
measurement in astronomy, physics, or even in biology or psychology be- 
cause error of measurement is known to be present. Instead, the scientist 
determines the causes of error and tries to eliminate them; knowing that 
he cannot do this entirely or completely he tries to determine the amount 
of error or the degree of accuracy in his measurements. When this has been 
determined he proceeds to use measurement to the best advantage with 
full recognition of the limits of accuracy of his results. 

As we gain knowledge and experience in a field, our measurement tech- 
niques improve, the margin of error decreases, and the results become more 
exact. Also, as workers in any subject area learn more about measurement 
they develop an attitude of suspended judgment and caution which helps 
them to avoid rash statements or conclusions not justified by the data or 
the degree of accuracy of their measurements. Moreover, knowledge of 
the probable limits of the error of their measurements makes it possible to 
specify the degree of accuracy quite closely. Instead of saying that a child's 
1.Q. is 115 one learns that it is more accurate and just as useful to say that 
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it is highly probable that his 7.Q. is between 105 and 125, or that there is a 
50-50 probability that his 7.Q. is between 110 and 120. "This may seem like 
rather rough or approximate measurement, and compared with results in 
some other fields, it is. On the other hand, even with that degree of possible 
error the results are still much more accurate than any other known methods 
of estimating the intelligence of children, and what is perhaps equally im- 
portant, the degree of accuracy or probable limits of error are known. 

Fourth, educational measurement is generally indirect rather than direct. 

. The weight of an object can be determined directly in pounds and ounces 
or other units by use of a balance or scale. By contrast, educational 
measurement is indirect. We do not measure such traits as intelligence or 
mechanical aptitude directly, but rather by inference. As an individual is 
able to perform designated tasks, we are able to draw from the results of his 
performance certain conclusions about his intelligence or aptitude. The 
same is true with measurements in the fields of school achievement or per- 
sonality or interests. In these fields the pupil’s knowledge, adjustment, 
or motivation are measured indirectly by inference based on his behavior 
and especially his performance on tests. 

Fifth, educational measurements are relative; they are not in any sense 
absolute. There is no unit of achievement in arithmetic, no unit of aptitude 
in music, no unit of school intelligence which is comparable to absolute zero, 
Centigrade, or the time of the earth’s rotation. Standards in educational 
measurement are based on observed performance of typical subjects. The 
evaluation of the performance of an individual or a group is made by com- 
paring it with that of a typical group. To be more specific, performance on 
educational and other tests is interpreted in terms of norms which are 
simply averages of typical groups on the test in question. A child’s score 
on a spelling test may be 40, which has no immediately clear meaning com- 
parable to 40 minutes or 40° Centigrade. However, if we know his age and 
we find that the average score of children of the same age on this test is 35 
his score takes on meaning. Furthermore, if we know that 15 per cent of 
the children of his age make scores above 40 his score takes on added 
meaning. It is for these reasons that measurement in education is said to 
be both indirect and relative. 

It might be added that educational measurement is also necessarily rela- 
tive because there is no established zero point. It is possible to start with 
no weight or no volume in physics, but, as Thorndike? put it years ago, no 
one has ever determined the point of “just no intelligence." Without such 
a zero point it is necessary to express results of educational measurement in 
terms of some other frame of reference, such as an average. The performance 


? E. L. Thorndike, et al., The Measurement of Intelligence (New York: Teachers College, 
Columbia University, 1927). 
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of an individual must be evaluated by a comparison with an average rather 
than by its relationship to “zero.” 

From the foregoing discussion it may be concluded that measurement in 
education is faced with many of the same problems and difficulties as meas- 
urement in other fields. However, it may seem that these difficulties are 
greater in education than in the more exact fields. One of the main reasons 
for this is the fact that the materials being measured — human beings — are 
constantly changing and sometimes difficult to control. A chemist, on the. 
other hand, is able to handle most of his samples in any way. he chooses. 
Most of them remain constant and uniform, and can be divided or mixed. 
With children such uniformity of sample and control of conditions is ex- 
tremely difficult to attain. And yet great progress has been made in a 
relatively short period. We do measure human beings in many ways al- 
ready, and the biologist is able to control and measure animals and plants 
— organisms that grow and change also. No one would say that measure- 
ment in biological science is impractical or useless. The biologist leans 
heavily upon quantitative methods for research and for practical work. 
Yet he deals with living, growing, changing organisms — a guinea pig or a 
cow or a tomato plant — which differ from boys and girls, as far as meas- 
urement is concerned, only in degree of complexity and susceptibility to 
control. 

Tt should be said again that measurement is only a tool. It is a means to 
an end. Yet it is valuable to the extent that it helps teachers, counselors, 
administrators, and others connected with the schools to do a better job of 
educating children and adults. Few would question that measurement has 
done much to help appraise what we do in education, to take education out 
of the realm of opinion and provide many valuable facts, and to point out 
ways in which the job can be done better. 

Much more remains to be done and can be done toward improving our 
existing instruments and techniques amd helping teachers and others to 
make more effective use of what is already available. Many difficulties in 
educational measurement have already been overcome, and new progress is 
being made every day. Those obstacles and imperfections that remain 
should,be regarded as a great challenge to the ingenuity, resourcefulness, 
and competence of those who work in this field. 
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8. Whataresome common sources of error and difficulties in educational measure- 
ment? What can be done about them? 
9. Compare measurement in education with measurement in physics. With 


measurement in biology. What are the important similarities? Differences? 
* 
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10. What attitude should a teacher or counselor take toward the problems and 
difficulties of educational measurements? 


TESTING — MEASUREMENT — EVALUATION 


The three terms mentioned above are widely used, sometimes inter- 
.changeably, and are the source of some confusion. Testing, as seems 
obvious, means the use of tests. It may mean testing the strength of ma- 
terials, as in the case of textiles; it may mean testing a class in arithmetic, 
or it may mean testing an individual's intelligence. It usually involves the 
use of some specific instrument or set of instruments to determine a certain 
quality or trait, or a series of such qualities or traits. For example, one may 
use a test battery to measure achievement in a variety of school subjects. 
As generally used in education, the term testing has come to have a rather 
specific and somewhat limited connotation and, in some instances, a slightly 
unfavorable one. A tester is regarded by some persons (whether rightly or 
wrongly) as a technician who is more interested in the scores and statistics 
of the results than in what the results mean in relation to the boys and girls 
who made them. "There is probably some justification for this attitude on 
the part of teachers if, for example, the tests are given to their pupils by ad- 
ministrative order, and particularly if their pupils do not do well on the 
tests. Often the result is that tests get a bad name undeservedly, since the 
fault usually lies in the way they were used. 

Measurement is usually conceived of somewhat more broadly than testing. 
It is thought of as including a greater variety of instruments than testing. 
Rating scales, check lists or score cards, and any devices which yield or can 
be made to yield quantitative results may be regarded as measuring in- 
struments. Also, measurement often implies a somewhat broader interpre- 
tation of results than testing, though this difference is very difficult to define. 
It might be said that a measurement program generally is thought of as 
having broader and more pupil-centered objectives than a testing program. 

Evaluation is conceived of as being the broadest of the three. It generally 
includes and often uses predominantly qualitative as well as quantitative 
instruments. That is, an evaluation program makes use of such devices and 
methods as anecdotal records, observation of children without any special 
attempt to make such observations quantitative, children's work samples, 
and the like. Such methods are not actually measurement as it has been 
defined above. They are dependent upon and largely limited to qualitative 
judgments, descriptive accounts, and opinions. An evaluation program 
may and often does include quantitative methods such as tests and scales, 
but it is not limited to these. 
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The term evaluation has found much favor at the elementary level where 
conditions for use of anecdotal records, qualitative judgments, etc., are 
more favorable. An outstanding characteristic of evaluative techniques is 
that they are very time-consuming. If a teacher is to keep his records in a 
way that will give meaningful and fairly reliable results, a great deal of 
his time will be required. A teacher of an elementary grade with the same 
35 pupils in his class daily for a year is in a much better position to make 
anecdotal records, to chart participation, observe behavior, and examine 
work samples than a high school teacher of mathematics or of English who 
meets as many as 150 or more different pupils for five periods per week. 

The distinctions between the three terms may be more apparent than 
real. "There is little doubt that testing programs have often been narrowly 
conceived. There can be little doubt, also, that tests have often been 
given and nothing done with the results. On the other hand, it seems 
equally true that the term evaluation is sometimes thought of as some 
magical and simple process which, without much effort or work on the part 
of anyone, provides answers to the most difficult and baffling educational 
problems. One important and basic principle should underlie all such pro- 
grams or activities: the instrument or technique should be chosen to fit the 
objective to be measured or evaluated. Whether the process be called 
lesling, measurement, or evaluation is not nearly so important as it is that 
the progress or status of the learner with respect to the desired goal is being 
determined. A second principle of equal importance is that no technique 
is worth using unless the results it yields can be depended upon in every 
sense, An evaluation procedure, like any test, is useful only to the extent 
that it yields data which are accurate and which mean what they seem, or 
are believed to mean. 

As a greater number and variety of tests, scales, etc., have been produced, 
the importance of using different kinds of instruments and relating data 
from one to another has become more anfl more evident. The broad concept 
of evaluation has contributed materially to the development of this point 
of view. It has emphasized the need for a great variety of measures or sam- 
ples of an individual's behavior and it has stressed the interrelatedness of 
such information in understanding and helping the individual. Evaluation 
has also called attention to the importance of traits or qualities or conditions 
not easily measured by objective tests. 

The term measurement is used throughout to express the area or field with 
which this book is concerned. The instruments commonly used in education 
today — tests, scales, inventories, rating scales —are all considered and dis- 
cussed. The major share of space is given to tests since they greatly exceed 
in number and variety all other types of instruments. Some attention is 
given to evaluation procedures also, but no complete treatment is attempted 
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because such procedures do not fall within the category of measuring in- . 
struments, and also because they have increased in number to the point . 


where such an attempt would go beyond the scope of this book. 
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11. What distinctions do you make between the terms testing, measurement, and | 


evaluation? What justification can you give for such distinctions? 
12. The expression has been used in some comparisons that “differences are of 


‘degree’ rather than *kind.' " What does this mean? Does it have any application ' 
to the question of distinctions between the three terms in Exercise 11? To what was |. 


said earlier about the terms quantitative and qualitative? 


PURPOSES OF THIS BOOK 


Basically, the purpose of this book is orientation. It is presumed that 
most students in a first course in educational measurement have had little 
or no systematic presentation of the principles and practices in this field. 
Consequently, the book assumes no background other than the usual intro- 
ductory courses in education required of those preparing to teach. It 
does assume a professional attitude and a willingness to work and learn in a 
field that is perhaps more technical than most undergraduate courses in 
education. 

Another importarit purpose of this book is to assist teachers and others 
who devise their own tests and evaluative devices. Every teacher makes 


tests and examinations of his own. This is a part of the job of teaching — : 1 
a necessary and important part. It is essential that this be done as well as . - 
possible, if only to insure that the least possible injustice be done to the in- . | 


dividual pupil. More positively, adequate measurement skills are impor- 
tant because without them it is impossible to determine whether or not we 
are making progress toward our educational goals. If we are making prog- 
ress we must know how much progress is being made, not only by groups 
but also by individuals. The skill of the teacher, counselor, or.school 
psychologist in devising and using measuring instruments plays an impor- 
tant role. "Therefore, one of the major purposes of this book is to provide 


the principles and the “know-how " so that those who have the responsibility | | 


of making examinations and other measuring instruments will be helped to 
do this better. 


A further purpose of the book is to present in an elementary way the “ 
tools and techniques for the intelligent use and interpretation of the results. 
of standardized and other measurements. Moreover, it is the particular ` 
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aim of this book to show how such results can be put to practical use in the 
school for the benefit of boys and girls and for the improvement of the edu- 
cational process. Too often a testing program is undertaken with great 
enthusiasm, and considerable time and money are given to it — only to 
have the tests selected, administered, and scored, and then filed away or tied 
up neatly in bundles, scarcely to be looked at again. Unless the results of 
tests are put to use the testing process becomes wasteful and pointless, 

In order to use tests and measurements effectively, it is necessary to know 
what materials are available. This book does not pretend to make an ex- 
haustive survey of all available educational measuring instruments since 
such a survey would be quite impractical. However, it does include de- 
scriptions and discussion of methods and devices in the major areas such as 
achievement, intelligence, and personality. Again, let it be understood that 
no attempt is made to describe or even list all available tests in such sub- 
jects or areas. The purpose here is simply to describe prototypes or typical 
examples so that the beginner in this field will be able to gain some knowl- 
edge and understanding of the kinds of instruments that have been devel- 
oped and found useful. 
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The Development of Educational 


Measurement 


EARLY BEGINNINGS — BEFORE 1900 


It may be assumed that teachers have always measured or evaluated the 
work of their pupils. Evidence of early records indicates that this was gen- 
erally done through personal observation, oral questioning, and subjective 
judgment by the teacher. However, some responsibility for evaluating the 
progress of pupils has traditionally been shared by citizens other than the 
teachers. It has been customary, in this country at least, to have a school 
committee of lay citizens in each community who would be responsible for 
the local schools, From such committees have evolved our present-day 
school boards. One of the functions of the early school committees was to 
visit the schools in their communities or districts at least once a year for 
inspectional purposes. During these inspections it was customary for 
the members of the committee to examine the pupils by asking them ques- 
tions, 


The Boston Survey 

The report of a school committee which visited the English High School 
in Boston in 1845 indicates that the members of the committee examined the 
pupils in algebra, geometry, and French, and reported that “the public have 
no reason to be dissatisfied with its (the school's) present condition." 1 

A few years before this time Horace Mann had been appointed Secretary 
of the Massachusetts State Board of Education. He soon was going about 
the state pointing out weaknesses as he observed them in the public (“ com- 
mon") schools. Naturally, the schoolmasters and the local school com- 
mittees resented his criticisms, and some thirty teachers and committee 


lOtis W. Caldwell and Stuart A. Courtis, Then and Now in Education, 1845-1923 
(Yonkers-on-Hudson, N.Y.: World Book Company, 1925), p. 4. 
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members in Boston banded together for the purpose of resisting and refuting 
him. The upshot of the quarrel was an agreement to prepare written exam- 
ination questions in history, arithmetic, geography, definitions (vocabulary), 
grammar, natural philosophy (science), and astronomy to be answered by 
the pupils. A total of 154 questions were prepared and these were answered 
in whole or in part by 530 pupils selected from a total of 7,526. Tt is said 
that this group of 530 pupils represented “the flower of the Boston Public 
Schools.” 2 The average age of the pupils examined was thirteen years and 
six months. 

Below are some typical questions, chosen at random, from this examina- 
tion: 


(a) What do you understand by the Norman Conquest? 

(b) What is the square root of $ of $ of 4 of $? 

(c) Name the principal lakes in North America. 

(d) Define *Monody." 

(e) What is the difference between an active and a neuter verb? 
(f) Explain the hydrostatic press. 

(g) What causes an eclipse of the sun? 


Giving the same written examinations to a sample of all pupils at the 
same school level in Boston was a novel procedure. Indeed, it appears that 
this was the first recorded instance of such a survey an ywhere. 

The results were eagerly awaited. We are told that the committee scored 
the papers under uniform conditions and tabulated the answers "question 
by question and school by school." * 

The results fully justified the criticisms by Mann and were a keen disap- ` 
pointment to the school committee. They revealed great inequalities among 
schools, and startling ignorance on the part of many pupils. We are told 
that Mann did not take advantage of his opportunity to ridicule or castigate 
his critics, but retained an impeccable professional attitude in commenting 
upon the findings. He did recognize the value of the method employed in : 1 
examining the pupils, though it was to be almost half a century before this 
method of evaluation would again be the focus of real interest, among edu- 
cators. 

Although there seems to be no record of anything like the Boston Survey 
in the United States for nearly fifty years thereafter, we are told of an Eng- 
lish schoolmaster, one Reverend George Fisher, who reported having con- 
structed what he called a Scale Book. The account of this is given in an 
article published in 1864 by E. B. Chadwick. In his Scale Book the 

2 Ibid., p. 171 3 Ibid., p. 7. 

4 E. B. Chadwick, "Statistics of Educational Results," The Museum, a Quarterly Maga- 
zine of Education, Literature and Science, 3:429-84 (January, 1864). Original not seen. 
Taken from a report entitled “Educational Measurements of Fifty Years Ago," based 
on a communication from E. L. Thorndike, Journal of Educational Psychology, 4:551 
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(November, 1913). 
2 2 


Karly Beginnings — Before 1900 19-7 


Reverend Mr. Fisher included a scale of handwriting against which samples 
of children's handwriting could be graded, a standard list of spelling words, 
and questions in mathematics, navigation, Scripture knowledge, grammar 
and composition, French, general history, drawing, and practical science. 
Thus he provided examinations by which any pupil could be tested and 
graded, not only in each subject or area, but also on the total and the aver- 
age of all, or any combination of, subjects. 

The work of Fisher, as in the case of Mann and the Boston School Com- 
mittee, made no great impression or had no immediately discernible or last- 
ing effect on practice in the schools, either in England or in the United 
States. 


Measurement of Individual Differences 


The next important figure in the development of measurement in educa- 
tion was an English scientist, Sir Francis Galton (1822-1911). He was 
one of. the first to sense the implications of the fact that individuals differ 
intellectually and emotionally as well as physically. It is hard to believe 
today that the schoolmaster of colonial times had so little appreciation or 
understanding of individual differences in children. If one child did not 
learn as easily and as well as another the difference was explained on the 
basis of laziness, and the way to cure that was by corporal punishment. 
Galton's work was very influential in changing such ideas. He demonstrated 
both by ingenious tests and by statistical methods that individuals differ 
in physical, mental, and social traits. He also laid the groundwork for 
modern statistical methods, without which progress in educational measure- 
ment, and particularly in standardized tests, would have been impossible. 

An American psychologist, James McKeen Cattell, contributed a great. 
deal to the measurement movement in the United States toward the end of 
the nineteenth century. He became intensely interested in the problem 
of individual differences and made a pumber of experiments in sensory- 
motor abilities. For example, he developed a variety of simple tests to 
measure the length of time it takes a given individual to press down a 
telegraph key after a light flashes, the rate of tapping, keenness of hearing 
and vision, etc. Cattell worked on the theory that differences in sensory 
keenness, speed of reaction, and similar abilities or traits would reflect. 
differences in intelligence. The results were disappointing in that no 
clear relationship was found between scores on such tests and intelligence 
as judged by success in school or college. He is credited, however, with be- 
ing one of the pioneers in the measurement movement and the first to use the 
term "mental tests.” Cattell and Galton both were more interested in 
measuring intelligence than in school achievement, but their ideas and work 
had significant and lasting influence on developments in the entire field of 
educational measurement. : 
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In 1895, almost fifty years after the Boston experiment, J. M. Rice,‘ in an 
endeavor to determine what teachers in the schools were actually accom- 
plishing, made up a list of 50 spelling words to be used as a test. These were 
given to more than 16,000 pupils in Grades 4 to 8. The results showed such 
wide variation that Rice devised another test in which the spelling words 
were used in sentences; more than 13,000 children were examined under 
his personal direction. Finally, a further test was devised by Rice whereby 
a story, accompanied by a picture, was read to the pupils, after which they 
were asked to write a composition about it. Their papers were then checked 
forspelling errors. Although our interest here is in the measurement aspects 
of his researches, it is interesting to note that Rice found great variation 
from class to class, school to school, and city to city, regardless of such 
factors as time devoted to study, location of the school, and efficiency of 
the teacher. 

Rice conducted similar studies with tests of his own in arithmetic and 
language over a period of nearly a decade. Because of the scientific objec- 
tivity of his approach, and his skill in devising measuring instruments, Rice 
stands as a pioneer in the field of measurement, though the significance of 
his contributions is not always fully appreciated. 

The work of men like Mann, Galton, Cattell, Rice, and others, the grad- 
ual application and adaptation of scientific methods of measurement in 
the social sciences, the great increases in school and college enrollment, and 
the development of a body of knowledge and principles for the training of 
teachers — all these, as well as other factors, combined to make the times 
ripe for new methods of measurement in schools and colleges toward the 
beginning of the twentieth century. 

It is impossible to say which of these factors were causes and which were 
effects. Certainly the greatly increased enrollments rendered almost impos- 
sible the oral and individual examining of the early days. Now that the 
typical secondary school teacher meets 150 pupils a day instead of 50, the 
older, more personalized methods of appraisal have had to be abandoned. 
Indeed, it is unusual if today's high school teacher is able to learn the 
names of all his pupils before he gets a new group. 

The scientific movement has influenced education as it has everything 
else in our world today. It has given us new techniques of measurement 
and appraisal to meet changing conditions in the schools. New practices in 
measurement have affected curriculum and methods, and changes in these 
have in turn encouraged improvements in measurement. Today, teaching 
and measurement are complementary, interdependent, and almost insep- 
arable. 


5J. M. Rice, Scientific Management in Education (New York: Hinds, Noble, and El- 
dredge, 1914), Chaps. 5-11. 
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1, What ideas or procedures used in the Boston Survey are exemplified in present- 
day educational measurement? 

2. Name one contribution to educational measurement of each of the following: 
Galton, Cattell, Rice. 


FROM 1900 TO WORLD WAR I 


The field of measurement was dominated and shaped by the work of a 
few great minds in the first two decades of this century. One of the most 
outstanding contributors to the field was Edward Lee Thorndike. Arriving 
at Columbia University just before the close of the 19th century, he soon 
gained a position of leadership which he held for more than thirty years 
in the field of educational psychology and measurement. His first important 
contribution to the measurement field was a book on statistical methods. 
Though no longer in use today as a textbook, it was the first of its kind. In 
this book Thorndike presented a compendium of available knowledge on 
the uses of statistical methods as applied to the measurement of human 
abilities and traits. It is perhaps significant that no other book on the 
subject appeared for more than ten years. 

As in the case of Galton and of Cattell, Thorndike’s interest in measure- 
ment grew out of his appreciation of the significance of individual differ- 
ences, Besides this interest, he possessed to an outstanding degree the 
ability to see clearly the implications and essentials of a problem or situa- 
tion, and a wealth of ideas for designing experiments and tests. Thorndike 
produced a number of tests and scales, including a scale for measuring qual- 
ity of handwriting, another for measuring quality of drawings, an intelli- 
gence test for use at the high school level, and several other tests. In addi- 
tion to these contributions and his many articles and books on measurement, 
Thorndike inspired and guided many graduate students to significant con- 
tributions in the field. Among the contributions made by his students were 
an arithmetic test devised by Stone in 1908,’ the Hillegas scale for measur- 
ing quality in English composition,* and Buckingham’s spelling scale.? 

6 Edward L. Thorndike, An Introduction lo the Theory of Mental and Social Measure- 
ments (New York: The Science Press, 1904), 212 pp. 

* C. W. Stone, Arithmetical Abilities and Some Factors Determining Them, Contribu- 
tions to Education, No. 19 (New York: Teachers College, Columbia University, 1908). 

3 M. B. Hillegas, “A Scale for the Measurement of Quality in English Composition by 
Young People," Teachers College Record, 13:331-84 (September, 1912). 


? B. R. Buckingham, Spelling Ability: Its Measurement and Distribution, Contributions 
to Education, No. 59 (New York: Teachers College, Columbia University, 1913). 
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The First Successful Intelligence Test 


Tt was during this same period that one of the most significant contribu- 
tions in intelligence testing was made by a French psychologist, Alfred 
Binet. Binet had become interested in mental measurement through his 
work with children in the schools of Paris. He had observed the great in- 
dividual differences in mental acuity or learning ability existing among 
these children, and he believed that it should be possible to devise some 
easily administered tests that would give accurate data on the amount of 
these differences. He became especially interested in finding some simple, 
' rapid, and precise method of identifying mentally retarded children and 
measuring the degree of such retardation. Although many psychologists, 
both in this country and abroad, were interested in the measurement of in- 
telligence, none had succeeded in devising adequate measurement techniques 
or methods. 

After much study and experimenting, Binet and his assistant, Théodore 
Simon, published an article in 1905 in which they presented a series of tests 
to measure the level of mental development in children. The article 
aroused world-wide interest, and many workers at once began experimenting 
with Binet's tests and corresponding with him about them. In 1908 Binet 
and Simon published an improved version of their scale, and in 1911 Binet 
published another revision. Unfortunately, he died soon thereafter. 

In another chapter there is a more adequate discussion of the theoretical 
considerations underlying the work of Binet and some of its implications for 

mental measurement. At this point it is enough to say that his work con- 
. stituted one of the most important milestones in the development of mental 
tests. His was the first successful method for measuring intelligence and 
expressing individual differences in accurate, quantitative terms. Indeed, 
Binet's method and materials for the measurement of intelligence form the 
basis of the general approach in use today. 


Tests of Personality and Character 


While the work in measuring school achievement and intelligence was 
going on, early attempts toward the measurement of emotions, interests, 
and personality were also being made. Although many writers suggest that 
such endeavors came later than those in the measurement of intelligence 
and achievement, this apparently is not the case. There is evidence that 
Galton used rating scales as early as 1883. There is also some indication 
that a rating device was known of and used, in much the same manner as 
rating scales are used today, even earlier than 1883.1° 


? Douglas G. Ellson and Elizabeth Cox Ellson, “Historical Note on the Rating Scale,” 
Psychological Bulletin, 50:383-84 (September, 1953). 
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The so-called *'free association" tests have also been used for nearly half 
a century. In these, the person tested is presented with a number of 
stimulus words, to each of which he responds by naming the first word that 
comes to mind. His responses are analyzed for emotional coloring and also 
for variations in the speed of response. That is, if the response to the word 
teacher is “kind” or “nice,” it reveals something different than if the re- 
sponse is “ugly” or “mean.” Also, a response which comes slowly or hesi- 
tantly may indicate an emotional block which is perhaps significant, and | 
which is probably not present in the subject who responds in the normal 
amount of time. 

Other methods which, though unreliable, have been used for some time 
in judging personality are based on physical and motor attributes. For 
example, elaborate systems of evaluating personality and intellect on 
the basis of body size and proportions have been devised. Many persons, 
among them some psychologists, are intrigued by the idea of using hand- 
writing as a measure of personality, and few persons can say that they never 
judge an individual’s personality by his face. It is but a short step from 
such systems to the pseudo sciences of palmistry and phrenology. None of 
these systems has ever become established or accepted by reputable psychol- 
ogists, at least in the United States, for the advocates of these questionable 
methods have never succeeded in demonstrating that their methods can 
produce reliable and valid results. In fact, when such methods have béen 
subjected to objective, empirical tests the results have been consistently 
disappointing. 

By 1915 the basic principles and techniques of educational and psycholog- . 
ical measurement were beginning to become established. The fundamentals * 
of statistical method were known, at least to the leaders in this field, and 
would soon be widely disseminated. Some important pioneering had been 
done, notably by a few outstanding men such as Thorndike and Binet. 'The 
Stage was set for a new era in the development of testing, and as the United 
States entered World War I in 1917 real progress was being made. 
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3. Thorndike wrote, Whatever exists at all exists in some amount. To know it 
thoroughly involves knowing its quantity as well as its quality." (See page 16 of 
The Measurement of Educational Producls, listed in the annotated bibliography at 
the end of this chapter.) What are the implications of this statement for educa- 
tional measurement? 

4. Binet at one time studied medicine, but became a psychologist. Of what value 
might his medical training have been to him in his great work of measuring intelli- 
gence? Might it have been a puc gr in some respects? If so, how? 
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5. Graphology, phrenology, and palmistry have certain objectives and techniques 
in common. What are they? How would you test the validity of claims made 
for these systems? 


FROM WORLD WAR I TO 1930 


The Army Tests 


When the United States entered the war in 1917, it soon became apparent 
that methods in use up to that time for appraising and classifying men were 
hopelessly inadequate. Hundreds of thousands of men were being drafted 
and within the short time of a month or two they had to be examined and 
assigned to duty. No instruments were known which would serve the pur- 
pose quickly and accurately. The Binet scale and its American revisions 
were individual examinations which required the complete time and atten- 
tion of a trained examiner for an hour or more to test one man. Obviously, 
this was too slow. Out of the need for a more expedient procedure came 
Army Alpha, the first group test of intelligence. The War Department 
requested a number of prominent psychologists to produce such a test, and 
this was accomplished within the remarkably brief time of a few months. 

The committee was fortunate in having placed at its disposal without 
reservation the work of Arthur S. Otis. Dr. Otis had made considerable 
progress in the development of a group test of intelligence, and it was 
largely his work which became the basis for Army Alpha. Army Alpha was 
a verbal test requiring approximately sixth-grade reading ability. Since 
many of the drafted men could not read at that level, some not at all, and 
others could not write, read, or speak English, another group test, Army 
Bela, was devised. This required no reading, and even the directions could 
be given in pantomime. Nearly two million men were tested with one or 
the other of these examinations in°1917 and 1918. 

At the same time, rating scales were devised for the army's use in classify- 


ing officers and men according to various qualities, and a personality test ' 


known as the Personal Dala Sheel was devised by R. S. Woodworth for use 
by the army in identifying and studying neurotic draftees. 

Some attempts were also made by the armed forces to develop tests for 
determining mechanical, clerical, and various other aptitudes, though these 
efforts did not receive the same attention that the other types of tests did, 
and the results were therefore less satisfactory. 

"The work in test development for the armed forces in World War I gave 
tremendous impetus to the development and use of tests in the schools. 
After the war, Army Alpha and Army Bela were released for general use and 


given to many thousands of high school and college students. Within a few 
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years a number of group tests of intelligence appeared, some patterned 
quite closely after the army test, others showing much originality. Among 
these new tests were the Otis Group Intelligence Scale,» the Miller Mental 
Abilities Test, and the Terman Group Test. 


Important Early Publications on Measurement 


During this period numerous books and monographs on educational 
measurement also began to appear. The first such book was published in 
1916.4 This was quickly followed by a more comprehensive volume in 
1917.5 At the same time other books on statistical methods in education 
began to appear. The first of these, by Rugg," was published in 1917. 

In 1918 there appeared a report that presented a stock-taking of the ac- 
complishments of the early period and some predictions fer the future." 
"This volume, assembled by leaders in the field — each one contributing a 
chapter dealing with some significant area or problem, such as the uses of 
measurement in the schools, existing tests and standards, etc. — was prob- 
ably the most significant publication in educational measurement up to 
that time. 


Survey Tests 


Another important contribution of this period was the development of 
survey tests. Such tests consist of batteries of achievement tests in several 
common branches of instruction, particularly language, arithmetic, social 
studies, and science. The first standardized survey test was the Stanford 
Achievement Test.!8 Designed primarily for use at the elementary level, this 
test has continued through various editions and revisions to the present day. 
Not long after its publication a similar survey test for high schools made 
its appearance. These were followed in a few years by a number of others. 


1 Arthur S. Otis, Otis Group Intelligence Scala(Y onkers-on-Hudson, N.Y.: World Book 
Company, 1918) 

2 W, S. Miller, Miller Mental Abilities Test (Yonkers-on-Hudson, N.Y.: World Book 
Company, 1921). 

48 Lewis M. Terman, Terman Group Test of Mental Ability (Yonkers-on-Hudson, N.Y.: 
World Book Company, 1920). s 

“Daniel Starch, Educational Measurement (New York: The Macmillan Company, 


1916), 202 pp. 

1 W, S. Monroe, J. C. DeVoss, and F. J. Kelly, Educational Tests and Measurements 
(Boston: Houghton Mifflin Company, 1917), 309 pp. A 

1 Harold O. Rugg, Statistical Methods Applied to Education (Boston: Houghton 
Mifflin Company, 1917), 410 pp. j 

U The Measurement of Educational Products, Seventeenth Yearbook of the National 
Society for the Study of Education, Part II (Bloomington, Ill.: Public School Publishing 
Company, 1918), 192 pp. ^ 

138 Truman L. Kelley, Giles M. Ruch, and Lewis M. Terman, Stanford Achievement Test 
(Yonkers-on-Hudson, N.Y.: World Book Company, 1923). ; 

1 G. M. Ruch, Iowa High School Content Examination (Iowa City, Iowa: Bureau of 
Educational Research and Service, University of Iowa, 1925). 
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Personality Tests 


Some significant beginnings in the measurement of personality were 
made during the period from World War I to 1930. Woodworth’s Personal 
Data Sheet,” previously mentioned, consisted of a list of questions based on 
common neurotic symptoms, and was useful in identifying maladjusted 
men, This was a prototype of many of the so-called personality inven- 
tories. 

One of the best known early measures of personality was the Rorschach 
Ink Blot Test, which was first published in Bern, Switzerland, in 1921, under 
the title of Psychodiagnostics: A Diagnostic Test Based on Perceplion. The 
Rorschach is still widely used today. Also among the early personality tests 
was the Downey Will Temperament Test.” In this, Downey attempted to ex- 
plore certain aspects of personality such as flexibility, finality of judgment, 
and interest in detail, by having the subject write or copy materials under 
different conditions, For example, the subject was asked to write as fast as 
he could, disguise his handwriting, or write with his eyes closed. Another 
early test, by Pressey,” was designed to measure emotional tone or feeling. 

A number of tests of character or ethical discrimination also appeared 
during this period, such as the tests of trustworthiness by Voelker.» These 
tests were designed to measure the effect of special training and instruction 
in trustworthiness through testing the tendency toward overstatement, the 
resistance to opportunities for cheating, and so forth. 


Aptitude Tests 


_ During the period under review some tests of aptitude also made their 
appearance. Some of the earliest work was done in musical aptitude by 
Seashore,” in mechanical aptitude by Stenquist,” and in the clerical field by 
Thurstone.”* 

During the 1920's a number of aptitude tests in the fields just mentioned, 
and in other fields, appeared in various forms. Without going into detail 
about such tests at this point, it may be said that generally they consist of 


? R, S. Woodworth, Personal Data Sheet (Chicago: C. H. Stoelting Company, 1918). 

?: June E. Downey, The Will Temperament and Its Testing (Yonkers-on-Hudson, N.Y.: 
World Book Company, 1923). 

2 S, L. Pressey, “A Group Scale for Investigating the Emotions," Journal of Abnormal 
Psychology, 16:55-65 (1921). 

23 Paul F. Voelker, The Function of Ideals and Altitudes in Social Education, Contribu- 
tions to Education, No. 112 (New York: Teachers College, Columbia University, 1921). 

^: C. E. Seashore, The Psychology of Musical Talent (New York: Silver, Burdett and 
Company, 1919), 288 pp. 

* J. L. Stenquist, Stenquist Mechanical Aptitude Tests (Yonkers-on-Hudson, N.Y.: 
World Book Company, 1922). 

* L. L. Thurstone, “Standardized Tests for Office Clerks," Journal of Applied Psy- 
chology, 3:248-51 (1919). 
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exercises of a type familiar respectively to the musician, the artist, the 
mechanic, the clerk. For example, the Seashore Test consists of exercises 
dealing with such elements as pitch discrimination, rhythm, duration and 
intensity or loudness of tones. It may be said that by 1930 the funda- 
mentals of the approach to the measurement of aptitudes in specific areas 
or fields had become fairly well established. 

A number of other factors contributed to the development of educational 
measurement during this period. Among those which might be mentioned 
here is the educational survey. Beginning in the 1920's, some of the larger 
cities and a few states employed teams of experts to make thorough surveys 
and appraisals of the school systems. While these surveys were largely 
concerned with school plant, administrative organization, qualifications of 
staff, and related matters, many of the surveys made use of tests of intelli- 
gence and achievement to measure efficiency of curriculum and methods, 
amount of retardation, and comparative achievement. This use of tests 
encouraged the development of new instruments and extended the use of 
existing ones. 

Statewide testing programs also stimulated the development and use of 
the objective, standardized test. A number of states began, and have con- 
tinued to the present, testing programs involving most, if not all, of the 
pupils enrolled in elementary grades. The New York Regents examina- 
tions, which were in use long before modern tests were available, may also 
have affected objective test development, though perhaps in a negative 
way. For many years these examinations were of the essay type and, be- 
cause of the large number of pupils involved, posed a difficult scoring prob- 
lem. 

Other developments that originated during this period and which un- 
doubtedly contributed to the growth of the modern measurement move- 
ment were the establishment of educational journals devoted at least in 
part to articles dealing with measurenfent, the organization of professional 
societies with membership including many persons interested and working 
in the field of measurement, and the development of bureaus of educational 
research in larger universities, city school systems, and state departments 
of education. These bureaus, especially the ones in higher institutions, 
contributed much to the development of new instruments and techniques 
of measurement through research carried on by staff members and graduate 
students. 

In a sense, this period represents some of the best and some of the worst 
in the era of modern measurement. It was characterized by great activity, 
both in the development of measuring instruments, and in their widespread 
application. In the use of the new instruments educators tended to be 
impressed with the values and advantages of the instruments, and were 
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less aware than we are today of the inherent limitations. Probably the 
same thing may be said of any new ideas or inventions; the early stages 
are nearly always characterized by great enthusiasm and a lack of aware- 
ness of shortcomings. The latter usually become evident with time and 
‘extensive use. If the new ideas have real merit, they will be constantly 
refined and improved through trial, experimentation, and critical study. 
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6. By 1930 the types of educational and mental tests known and used today were 
quite well established. Name and briefly describe these early models. 

7. What factors during the 1920's contributed substantially to the development 
of, and growth of interest in, educational measurement? 

8. Name some factors or occurrences during this period, which, while related to 
the area of measurement, failed to advance or improve measurement techniques. 


FROM 1930 TO THE PRESENT 


The 1930’s may be regarded as a period of questioning and doubt. As 
more and more objective tests, both teacher-made and standardized, were 
used, some unfavorable reactions were inevitable. Many test specialists 
and users began to raise questions about the new type tests, particularly 
regarding the limitations of scope. There was a feeling that while the tests 
were generally good, they failed to measure some of the most fundamental 
educational objectives. Some tests were critivized for being too specific 
and for not testing the pupil's ability to organize his knowledge and present 
an acceptable written statement of it. 

Although objective standardized tests continued to be published in great 
quantities, and one of the largest organizations devoted to such activities, 
the Cooperative Test Service, was tstablished about 1930, the movement 
to develop other kinds of measures made steady progress. A number of 
leaders in this field, together with such organizations as the Progressive 
Education Association, became strongly identified with the evaluation move- 
ment. The new work in evaluation emphasized the importance of measuring 
more than knowledge and skills. More attention was focused on measure- 
ment of such outcomes of instruction as attitudes, interests, appreciations, 
and ability to use the scientific method. Furthermore, the newer techniques 
emphasized the importance of supplementing standardized tests with 
locally-made tests in order to measure fairly those areas emphasized by the 
individual teacher, areas which the standardized test might treat inad- 
equately or not at all. 

It is likely that Gestalt psychology, which emphasizes the inter-related- 
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ness of the parts of a whole, also had a marked influence on teaching and 
measurement during this period. Although few teachers are likely to have 
a thorough knowledge of the wide implications of the Gestalt school of 
psychology, most have assimilated something of the basic principles of the 
Gestalt school and have formed therefrom a concept of “the whole child,” 
which has become almost a cliché. 

Nevertheless, this concept has had a generally beneficial effect on educa- 
tional measurement. It has served to remind all who deal with human be- 
ings that a person is an individual unlike any other, and that there are many 
different facets and aspects of his personality, all of which combine to make 
him the person he is. In order to help an individual we must understand 
him, and in order to understand him we must know as much about him as 
possible — about his knowledge, interests, health, family, ability, and ex- 
perience. It is essential, moreover, to try to understand the relationship 
of these various aspects to each other. Thus, it can be argued that in order 
to know as much as possible about a person, more, not fewer, tests and 
measurements should be used. The more aspects of an individual we are 
able to measure, the more complete and rounded will be our knowledge and 
understanding of him, and, therefore, the better we should be able to help 
him. 


Measurement and World War Il 

As in World War I, the use of measurement was greatly stimulated by 
World War II. Though no contributions comparable in originality and 
uniqueness to Army Alpha 8nd Army Bela were produced as a result of the 
Second World War, testing became widespread in all the armed forces, both 
in amount and variety. A great deal of work was done in devising effective 
aptitude tests for the placement of personnel in such specialties as radio, 
navigation, and radar. The procedures set up for appraisal and assignment 
of men and women were far more systematic than those which had been 
developed in the First World War. A large amount of research on the 
nature of human abilities was conducted by the armed services and much 
experimental work in the development of tests, rating scales, and other 
assessment techniques was carried on. 

The war also stimulated interest in the use of clinical instruments, espe- 
cially projective tests like the Rorschach. Following the war, the federal 
government approved and supported the establishment of programs for 
the training of clinical psychologists in many universities. 


The Guidance Movement 
As early as 1920 and even before, some tests were devised for use in edu- 
cational and vocational guidance. Some of these, such as the aptitude tests, 
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have been mentioned. The establishment of federal-aided vocational pro- 
grams in agriculture, trade, industry, and homemaking stimulated the 
guidance movement, as did the expansion of the high school program. As 
more and more adolescents entered secondary school, and as the curriculum 
became more varied, the need for more systematic and effective counseling 
increased. 

Probably the greatest impetus to guidance and counseling came as a re- 
sult of World War II, however. With millions of men and women being in- 
ducted into military service, the task of classifying and assigning each to 
the kind of duty which he or she could perform efficiently with a minimum 
of training became of paramount importance. It was imperative that rapid 
and reasonably accurate classifying methods be devised. As a result of this 
need, thousands of men and women became classification officers or special- 
ists. Considering the number of cases they had to deal with, the great vari- 
ety of specialties which had to be filled, and the inevitable lack of knowledge 
and precedent, these specialists did a remarkable job. The wonder is that 
they made so few mistakes. 

All this — the changes in the secondary sehool population and curriculum, 
the development of tests of aptitude, and the demands of the war — gave to 
guidance practices a momentum and growth which have continued to the 
present time. Such growth is reflected in the vastly increased use of and 
need for tests and measuring instruments of all kinds in counseling, as well 
as in other areas. 
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9. What were some of the main criticisms of tests and testing which developed in 
the 1930's and 1940's? Do they apply today? 
10. Compare the contributions to the measurement field of World War I and 
World War II. ; 
11. The guidance movement of today is a fairly recent development. Did the 
improvement of measurement techniques influence this? If so, in what ways? 


THE PRESENT AND THE FUTURE 


Since 1945 the measurement field has continued to develop and expand, 
though no really significant innovations have been introduced. The reaction 
against uncritical use of objective tests in the schools has had a good effect, 
yet certainly it does not appear that fewer standardized tests are being used; 
rather, those now appearing generally show some attempt to measure more 
than knowledge as an outcome of schooling. 

In the area of intelligence and aptitede testing one relatively new ap- 


The Present and the Future 31 


proach is worthy of particular mention. This is called factor analysis, a 
statistical method of refining tests — that is, reducing them to a few basic 
elements or factors. Thus, a well-known intelligence test, the Primary 
Mental Abilities Test,” purports to measure general mental ability as ex- 
pressed through a small number of group factors designated as space, number, 
verbal, word fluency, and reasoning. These factors are the ones which the 
authors of the test find adequate for measuring almost the whole area of 
general mental ability or intellect. Other experimenters have suggested 
different numbers and designations of factors; indeed, the theory goes back 
at least as far as 1904 to what appears to be the first published work on 
factors, that by Charles Spearman,” an English statistician. In his inter- 
esting paper Spearman reviews previous work on the measurement of in- 
telligence and advances for the first time his theory of a factor of general 
intelligence and some specific factors relating to particular or special abilities. 

With aptitude tests or batteries these group factors are related to par- 
ticular vocations or fields, such as mechanical or clerical. For example, a 
person scoring high on space, reasoning, and number might thus show a pat- 
tern of aptitudes that would be indicative of potential for success in me- 
chanical pursuits; one making high scores on verbal, word fluency, and 
reasoning, for success in persuasive endeayors such as selling. These lines 
of research have important implications for the future of measurement. 

In general, the present decade may be characterized as a time of adjust- 
ment and refinement in educational measurement. Techniques or methods 
comparable in originality or significance to those of the first quarter of this 
century do not seem to be emerging. Instead, refinement of existing tests 
and development of new ones along established patterns are going on con- 
stantly. It is possible that the measurement movement is on the verge of 
some new and original developments. In 1900, physicists felt generally 
that they had few frontiers left to explore. Physics was considered virtually 
a closed science! Today, however, as @ result of recent work on the nature 
of matter, physicists have a whole new world in which to pioneer. Whether 
a similar new era will eventually open in the field of educational measure- 
ment remains to be seen. Certainly there are many unsolved problems, a 
few of which have been touched upon in this brief historical chapter. 
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12. What changes of emphasis in objectives have been taking place in educa- 
tional measurement during the last twenty-five years? Can you cite evidence to 
support your answer? 

“L. L. Thurstone and Thelma Gwinn Thurstone, Chicago Tests of Primary Mental 


Abilities (Chicago: Science Research Associates, 1941-1947). è 
28 Charles Spearman, “ ‘General Intelligence’ Objectively Determined and Measured,” 


American Journal of Psychology, 15:201,93 (April, 1904). 
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13. What is factor analysis? Can it be applied to the development of tests other 
than intelligence tests? Explain. (Perhaps you can find an illustration in educa- 
tional literature.) 

14. List a few of the “unsolved problems” in educational measurement, Which 
do you consider to be the most important? 
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A Little Statistics 


INTRODUCTION 


While most of us probably make no claim to extensive statistical knowledge 
or training, we all have at least an elementary understanding of statistics. 
For example, most men are familiar with batting and fielding averages as a 
measure of the skill of the professional baseball player; most housewives 
are able to read and understand reports in which price ranges and average 
prices of various articles are discussed; most of us understand weather 
bureau reports on the average rainfall, temperature, and barometric pres- 
sure. In addition, vital statistics such as birth and death rates have some 
meaning for all of us. 

As students and teachers or prospective teachers we know well the sig- 
nificance and importance of the class average in the assignment of marks. 
Few of us are totally unfamiliar with the meaning of such terms as rank in 
class, percentiles, medians, quartiles, and correlation. So, while we may not be 
statisticians in any formal or learned sense, we are all users of statistics in 
our work and in our recreational activities. 

Any teacher or counselor who uses tests or measuring instruments in his 
work soon learns that some systematic treatment of scores, is necessary to 
assure maximum benefit to all pupils and teachers concerned. The degree 
to which the results of measurement are useful is in part proportional to the 
accuracy and thoroughness with which those results are analyzed. This 
does not mean that every set of test scores must be subjected to exhaustive 
analysis, but it does mean that test scores in themselves have no significance 
except when some statistical analysis of them has been made. For example, 
to say that John’s score on an arithmetic test is 36 does not tell us anything 
about his achievement in arithmetic. Before we can interpret such a num- 
ber we need more information. Taken by itself it is merely a number. 
Likewise, if we wish to use the results of a test to make comparisons between 

. classes or groups or to do a better job of teaching or counseling it is neces- 
sary to make some analysis of the test scores. Such analysis can be per- 
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formed only by statistical methods, however elementary these may be. 
Statistical methods are simply tools to help us acquire understanding of pu- 
pils from numbers which, in themselves, are of little or no value to the 
teacher. 

As most of us are already more or less familiar with some common sta- 
tistical terms, it is the purpose of this chapter to round out our understand- 
ing of basic statistical ideas and techniques, and to show how statistical 
analysis of a very elementary nature can be applied by any teacher or coun- 
selor to scores on tests so that the scores can be made meaningful and there- 
fore useful. 

For those who are interested, a more extensive treatment of the techniques 
of simple statistical analysis, together with examples and guides, is given 
in Appendix A. 


e Learning Exercises e 


1. What uses does the average consumer make of statistics? The government? 


The classroom teacher? 
2. Make a list of statistical terms or concepts that you know. How many can 


you define? 


INTERPRETATION OF TEST SCORES: SIMPLER METHODS 


Simple Ranking 

There are a good many ways to approach the interpretation of a test 
score. Let us consider one of the more elementary ones, using John's score 
of 36 on arithmetic as an example. It is apparent that we need more infor- 
mation to make any interpretation of the score of 36. One rather easy and 
quite useful method is that of ranking. ,If we know the scores of the other 
members of the class we can place John according to his relative position. 
This may be done by arranging the scores in order, from highest to lowest, 
and assigning each score a number according to its position. Usually we 
give the highest score a rank of one, the second highest a rank of two, etc. 

In order to find John's rank in arithmetic we must know all the scores in 
his group or class. Let us suppose they are as follows: 


ÅRITHMETIC TEST SCORES 


Pupil eB Cum Ey PAG e Up A ^L 

Score 44 21 14 18 46 45 52 30 39 36 31 22 

Pupil MNOPQRSTUVWXY 

Score 23 38 33 33 29 38 32 29 42 28 26 33 25 
Q 
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John’s score is 36, but we cannot tell anything about his rank until the 
scores are rearranged in order from highest to lowest. When this has been 
done the scores look like this: 


Pupil Ge BOESASCBOIESNOR 4 O0 P X 
Score 52 46 45 44 42 39 38 38 36 33 33 33 
Pupil Sched VEN Y MC E BD C 
Score 32 31 30 29 29 28 26 25 23 22 21 18 14 


When the scores are ranked we have the following array: 


Pupil Corti we UU SONSUROC UO P X 
Score 52 46 45 44 42 39 38 38 36 33 33 33 
HankinClas 1 2 3 4 5 6 715 75 9 M di ll 
Pupil See nO DV Ww eT BB D & 
Score 32 31 30 29 29 28 26 25 23 22 21 18 M 


Rankin Class 13 14 15 16516518 19 20 21 22 23 24 25 


Note that where two or more scores are alike, the places they would 


otherwise hold have been averaged and the average thus obtained has been | 


given as a rank to those pupils with the same scores. "Thus, N and R both 
have a score of 38. I, with a score of 39, ranks 6th and so N and R would 
occupy 7th and 8th place. By averaging these (7 and 8), each score re- 
ceives a rank of 7.5. Similarly with O, P, and X, and Q and T. 

John's score of 36 in arithmetic places him 9th in his class, or, to put it 
another way, there are 8 pupils whose scores on the test are better and 16 
whose scores are poorer than his. This gives definite meaning to his score, 
which it did not have before. 


Percentile Rank ^ 

The method of ranking just described has certain disadvantages which 
sometimes are rather troublesome. The most important of these arises 
from the fact that the procedure takes no account of differences in the size 
of groups. For example, a rank of 15 in a group of 15 means quite a differ- 
ent thing than a rank of 15 in a group of 100. In the first case the rank is the 
lowest in the group, whereas in the second it is one of the best. As long as the 
comparisons of individual ranks are based on groups of the same size this is 
no problem. Where the groups are quite different in size some other pro- 
cedure must be employed if a person’s standing in one group is to be com- 
pared with his standing in another one. One method which eliminates this 
difficulty is that of percentile ranks. Percentile ranks differ from simple 
ranks in that they express the position of any score in the group in terms of 
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the percentage of the group below that score. In our earlier example we 
said John’s score of 36 gave him a rank of 9 in the class of 25. We can also 
say that his score of 36 or rank of 9 places him above 16 others. Putting 
this in another way we can say that 4$ or 64 per cent of the class makes a 
lower score than John on the arithmetic test. This is his percentile rank. 
By use of percentages, differences between sizes of classes or groups are 
equated. 

Now John might take a history test with his whole grade of several 
hundred pupils. If his rank on this test were 44 it would be meaningless 
to compare this with his rank of 9 in arithmetic. Even if we knew that 
John’s rank of 44 on the history test were based on a group of 200, it would 
still be difficult to compare it with his rank of 9 in a group of 25. However, 
we know his percentile rank in arithmetic to be 64. In order to calculate 
his percentile rank on the history test we find first the number who make 
lower scores than John. 200 — 44 = 156. 33$ equals .78, giving John a per- 
centile rank of 78. In other words, 78 per cent of the class made lower 
scores than he did on the history test. Thus we can compare his standing 
on the two tests directly and say that his achievement on the first test (arith- 
metic) is apparently not as good as his achievement on the history test. 


Central Tendency or Averages 

We may go beyond ranking and ask about the average of the class. Know- 
ing this will assist us in further interpreting John’s score. Determining the 
class average introduces another concept, that of averages, or central tendency. 
An average is a number, not always an actual score, which is taken as the 
most likely or typical value for a group of numbers or scores. There are 
several kinds of averages, but those most often used in educational work 
are the arithmelic mean and the median. 

The arithmetic mean is obtained by aiding afl the scores and dividing by 
the number of scores. Thus, in the case of John's class in arithmetic we 
would add the scores of all the pupils and divide by 25, the number of 
pupils. (See Appendix A for the calculations of the mean by two different 
methods.) In baseball the player's batting average is obtained by dividing 
the number of times he hits safely by his total number of official times at 
bat. Therefore, if he got a hit every time his average would be unity or one. 
Actually, his average is always a decimal less than one, since no regular 
player has ever hit safely every time at bat for a whole season. 

The other common measure of central tendency is the median. This is 
most simply the middle score or the one in a series which has an equal 


1'The word average is often used synonymously with arithmetic mean, but this usage is 
not precise since all measures of central tendency are averages. 
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number of scores above it or higher than it is, and an equal number below it 
or lower than it is. Thus, to illustrate, if the heights of five pupils are re- 
spectively 60, 55, 52, 50, and 48 inches, the mid-score or median is 52 since 
two children are taller than that and two are shorter. If the number of 
cases is even, the median is midway between the lowest score of the upper 
half of the group and the highest score of the lower half. In our example, 
if we had six children instead of five, with heights of 60, 58, 55, 52, 50, and 
48 inches respectively, we would say the median was midway between 55 
and 52, or 53.5. 

To return to John and his arithmetic, suppose we have added the scores 
of all 25 pupils and divided by 25 and found the arithmetic mean to be 32.3. 
John’s score of 36 would place him definitely above the mean, or, as we might 
say, he is above average on this test. If we arrange the 25 scores in order 
from highest to lowest we find the median or middle score (the 13th one in 
this case) to be 32, almost the same as the arithmetic mean. (See Appen- 
dix A for a more exact method of calculating the median.) Another way of 
defining the median is to say that it is the 50th percentile or that it is the 
score or point that has a percentile rank of fifty. 


Comparison of Mean and Median 


With larger numbers of scores or cases, the median is a little simpler to 
calculate than the mean. The chief difference between them can be shown 
by the following example. Suppose one wishes to find the average salary of 
a group of executives whose annual salaries were $50,000, $20,000, $15,000, 
$13,000, and $12,000. The arithmetic mean would be calculated thus: 


$50,000 
20,000 
15,000 
13,000 
12,000 
5)110,000 
22,000 


The median would be the mid-score or mid-salary, namely $15,000. Which 
of these would be the more representative or more typical average? It 
requires no technical knowledge to answer this question since it is obvious 
that the median — $15,000 —is much more representative of four of the 
five salaries than the arithmetic mean which is $22,000. There are situa- 
tions in which the two measures give the same or nearly the same value. 
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The arithmétic test scores presented earlier are a case of this kind, Here it 
really makes little difference which measure is used, 

In certain situations where it is important from a statistical standpoint 
to give every score or measure its full weight according to its magnitude, the 
arithmetic mean is the average to use, but in nearly all ordinary situations 
encountered in school work the median serves equally well, and it has the 
advantage of being simpler to calculate and understand. 

To return to the question of the meaning of a test score, and to sum up 
what has been said so far, a score has no meaning except as it can be com- 
pared to other scores in the same series, or to the central tendency or aver- 
age of all scores in the series. 


e Learning Exercises € 


3. The following scores were made on a test of word meaning by 24 ninth-grade 
pupils: 45, 50, 41, 39, 45, 33, 42, 44, 38, 44, 25, 24, 44, 50, 32, 42, 40, 49, 60, 29, 50, 
37,47, 55. Rank them. 

^. Find the percentile ranks of scores 33, 42, and 60. Can a score have a percentile 


rank of 100? Justify your answer. 
5. Find the mean and the median of the 24 scores, Are they the same? If not, 


why not? 


INTERPRETATION OF TEST SCORES: MORE REFINED METHODS 


Inadequacy of Central Tendency or Averages 

Although comparisons of one score with others in the same group or 
series are helpful and give meaning to the score, such comparisons are not 
always sufficient for all purposes. It may be desirable to know, for example, 
how John's score on arithmetic compares with his score on a reading test, 
Or again, it may be very useful to a teacher to know how one group or class 
compares with another, in 7.Q., for example, or in such traits as height and 
weight, or in two different school subjects. In order to make such compari- 
sons we need more than ranks or averages. Let us suppose that a teacher 
has two classes in general science and has given both of them a test. He 
has calculated the mid-scores of Class I and of Class II and they are exactly 
the same, namely 47. To make the illustration very simple, let us assume 
that there are just nine pupils in each class and that the scores on the test 


in the two classes are as shown in Table I. 
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Table | 


Comparison of Groups Having Same Average but Differing 
in Spread or Range of Scores 


Pupil Class | (in order) Pupil Class Il (in order) 


40 (M) 61 
61 (N) 55 
55 (R) 53 
38 (P) 50 
50 (Q) 47 (mid-score) 
47 (L) 40 
53 (s) 39 
39 (O) 38 
31 (T) 31 
9| 414 


(B) 70 
(I) 64 
(F) 59 
(A) 51 
(E) 47 (mid-score) 
(D) 44 
(C) 32 
(G) 28 
(H) 19 


"Mo»ovozzr- 


Mean equals 46. Mean equals 46. 
Range equals 70 — 19 equals 51. Range equals 61 — 31 equals 30. 


Inspection of these scores shows that the mid-score of the two groups is 
41, as has been stated; the arithmetic mean of the two is also the same, 
namely 46. A teacher or counselor who made no further breakdown of these 
scores might easily conclude that the two classes are comparable in all re- 
spects as measured by the test. However, a closer study shows a marked 
difference between the groups, not in central tendency but in spread or vari- 
ability. No informed teacher would handle two such classes in tle same 
way! Whereas Class I has a score range of 51 points, from a low of 19 to a 
high of 70, Class II has a range of only 30 points, from 31 to 61. In other 
words, we can say that on this test, Class II is more homogeneous than Class 
I, or that Class I is more heterogeneous than Class IJ. This difference would 
have considerable bearing on the methods used in handling the two classes. 


Common Measures of Variability 


The range is a very crude or rough measure of variability since it is based 
on only two measures, the highest and lowest scores. It is used here for the 
sake of simplicity in bringing out the main point, that is, the importance of 
the concept of variability in interpreting test results. In comparing groups 
or classes it is sometimes more important to know something about their 
respective spreads or variabilities than to know what their averages are. 
This is a fact not always appreciated as it should be. Generally speaking, 
knowledge of both types of measures is necessary for adequate comparisons 
of test scores, 

There are several other measures of variability, two of which every teacher 
should know about. These are the semi-inlerquartile range and the standard 
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deviation. The former is a measure of the spread of the middle half of the 
scores in any distribution. If we cut off the highest 25 per cent and the 
lowest 25 per cent of the scores in a class, the middle 50 per cent remains. 
Half the difference between the point which cuts off 75 per cent of the scores 
from the bottom of the distribution and the score which cuts off 25 per cent 
from the bottom, is the semi-interquartile range, designated by the capital 


Q: -Qı 


letter Q. There is a formula for this, Q = 3 where Q — semi-in- 


terquartile range, Qs = 75th percentile, and Q, = 25th percentile. If we 
compare the Q's of two groups or classes on the same test and find Q to be 
larger in one group than the other, it tells us that one group is more varied, 
diverse, or heterogeneous than the other. The semi-interquartile range is a 
refinement of the crude range. It eliminates the highest and the lowest 25 
per cent of the scores, which are likely to be more scattered and unreliable, 
and expresses variation in terms of the more stable and concentrated middle 
50 per cent of.the scores. Whereas the range would be strongly affected by 
one score at the top or bottom of the distribution, the Q would not be af- 
fected at all by a single extreme score. 

It is often helpful in thinking about averages and measures of variability 
to remember that an average is always a point or score, whereas a measure 
of variability is always a distance. For example, in the nine scores made by 
Class I and shown in Table I, the mean score is 46 — a single score or point 
on the scale; the range is 51 score points, a distance between the highest 
score of 70 and the lowest score of 19. Similarly, in Class II, the mean is 
46, while the range is the difference between the highest score and the low- 
est score, in this case, 61 — 31 = 30. 

From a statistical standpoint, the best all-around measure of variability 
is the standard deviation, usually designated by a sigma (c). The standard 
deviation is a true measure of variability in that it is based on the deviations 
of scores from a measure of central tendéncy, usually the arithmetic mean, 
and it is the most reliable one because it takes into account the actual vari- 
ation of each score from the mean of the series. 

The standard deviation may be defined as that distance which, laid off 
above and below the mean, will include 68.26 per cent of the scores or cases. 
Tn a distribution which diverges strongly from the symmetrical, bell-shaped 
form, this percentage will vary somewhat depending on the degree of 
“skewness” or asymmetry of the curve. It is sufficiently accurate for most 
purposes to say that the o or standard deviation includes approximately 
two-thirds of the scores or area under the normal curve on both sides of the 
mean. This provides a measure of the spread of scores, or conversely, the 
tendency of the scores to cluster about the mean. For example, if we give 
a test to three classes the results might be as follows: 
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Crass I Crass II Crass III 
Mean 16 71 80 
o 8 12 16 


Where do the middle two-thirds of the scores in each class fall? In Class I, 
the scores of the middle two-thirds of the pupils fall within the limits, 76 + 8, 


or between 68 and 84; in Class II, they fall within 71 + 12, or between 59 - 


and 83; and in Class III, they fall within 80 + 16, or between 64 and 96. 
Which is the least homogeneous class? Which do you judge is most homo- 
geneous? The mean and standard deviation give the basis for reliable 
answers to such questions and many others of interest to educational work- 
ers, psychologists, and statisticians. 

The details of calculating the standard deviation are shown in Appendix 
A. Going back to John's arithmetic class on page 35, when we determine 
the standard deviation for the 25 scores, we find that « = 9.2. 

The student will recall that the mean of the arithmetic scores in John’s 
class was 32.3. If we now subtract one standard deviation from the mean 
we get 23.1. Likewise, by adding one standard deviation to the mean we get 
41.5. Between these two points, each a distance of one standard deviation 
from the mean, we should find approximately 68.26 per cent of our 25 arith- 
meticscores. We find that the score nearest to 41.5 is 42, and the one near- 
est to 23.1 is 23. Beginning with the score of 23 on page 36, and counting 
each score up to and including 42, we find 17 scores lying between the two 
points. What proportion of 25 is 17? 42 equals 68 per cent exactly. In 
this case the calculated standard’ doen d value of 9.2, when actually ap- 
plied to the distribution by adding and subtracting it from the mean, gives 
us two points on the scale between which 68 per cent of the cases lie. Re- 
sults will not always conform so closely to the theoretical since many dis- 
tributions in practice are less symmetrical than the one used here. 

To illustrate one of the chief uses of the standard deviation in interpreting 
scores on tests, let us assume that. the class of which John is a member has 
been given another test, this time in reading. As the teacher, we wish to 
know how the individual pupil's achievement in reading compares with his 
score on arithmetic. Since we have already done some work on the arith- 
metic scores, let us now consider the results of the reading test. We find 
that the scores of John and his classmates on the reading test are as follows: 


Reapine Test Scores 


Pupil A Bo IXGE ABO Ho dq A XR AE 
Score 86 68 48 70 94 102 92 72 91 80 69 62.56 
Pupil Nu" Paige) Elus semi) VO ANTON 
Score 73 66 77 75 65 87 46 lll 81 59 77 76 
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When these are arranged in order and ranked, we get the following results: 


Pupil Ui EXQG A JARBMNASANDLAPGPO X-Y Q 
Score 111 102 94 92 91 87 86 81 80 77 77 76 75 
Rank in Class 1 20:5394155:56 2719279: 10.58 10.5; 12:18 
Pupil NISH EDERSE AOAR AWAM 6m 
Score 73 72 70 69 68 66 65 62 59 56 48 46 


Rankin Class 14 15 16 17 18 19 20 21 22 23 24 25 


If we try to compare the scores of individuals on the two tests directly 
we find it rather confusing. For example, pupil A makes a score of 44 on 
arithmetic and 86 on reading. Does that mean he is almost twice as good 
in reading as in arithmetic? Pupil B scores 21 in arithmetic and 68 in 
reading. Is his reading three times as good as his arithmetic? Careful in- 
spection of the scores on the two tests shows that all pupils make higher 
scores in reading than they do in arithmetic. Does this mean that all pupils 
are better in reading than in arithmetic? Or does it show that the reading 
test is easier than the arithmetic? Actually, we cannot answer any of these 
questions without further information. A comparison of ranks shows that 
A has a rank of 4 in arithmetic and 7 in reading; B ranks 23rd in arith- 
metic and 18th in reading. We can see some tendency toward agreement in 
the ranks on the two tests, but not much more can be learned from them. 
A better method than simple ranks or percentile ranks for comparing such 
scores is now rather widely used. It is based on the respective means and 
standard deviations of the two sets of scores. 


Standard Scores 

Table II shows the means and standard deviations of the class, and 
John's scores on the two tests. The means are comparable since they repre- 
sent average achievement of the same pupils on the same examinations. 
Likewise, the standard deviations are comparable and scores can be com- 
pared in terms of these values. For example, a point one standard deviation 
below the mean on the arithmetic test would be 32.3 — 9.2, or 23.1. A sim- 
ilar point on the reading test would be 75.3 — 15.4, or 59.9 These two 
points or scores represent the same levels of achievement on the two tests 
since they represent the same relative attainment on each test. Similar 
points can be worked out all along the scale, both below and above the 
mean, in standard deviation units. 

"Thus, in the case of John, his arithmetic score is 3.7 points above the 
mean (36 — 32.3 = 3.7). Dividing 3.7 by the standard deviation yields a 
value of .40 (3.7 + 9.2 = .40). This tells us that on the arithmetic test 
John's score is .40 of a standard deviation above the mean of his class. On 
the reading test, John’s score is 4.7 points above the mean (80 — 75.3 = 4.7). 
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Table Il 


Comparison of Results of Arithmetic and Reading Tests 


Arithmetic Reading 


Class Mean 
Standard Deviation 
John’s Score 


Since the standard deviation in this case is 15.4, we divide 4.7 by 15.4 which 
yields a value of .30 (4.7 + 15.4 = .30). Since both these values, .40 and 
.30, are expressed in the same units — namely, the means and standard 
deviations of the two distributions — they may be compared directly. 
"Therefore, we can say that John's score on the arithmetic test is slightly 
better than his score on the reading test. 

'The calculation of these standard scores, which are sometimes called 


SUM. -M $ 
Z-scores, can be generalized in a formula: Z-score = , where X = 


actual score, M = mean, and g = standard deviation. 

Tt will be readily seen that a score below the mean will yield a negative 
Z-score. For example, in reading, K’s score is 69. Substituting in the 
formula, we get Z-score = ted ae = EB = —Al. From this value 
we know that K's reading score is .41 of a standard deviation below the 
mean. 

It is often inconvenient to work with decimals and negative values, as is 
necessary with Z-scores. These may both be eliminated by converting the 
Z-score to a score with an arbitrary value for the mean and sigma or stand- 
ard deviation. If we assume a mean of 50 and a standard deviation of 10 
this will be accomplished. It is necessary to simply multiply the Z-score 
by 10 and add the product to 50. hus, John's Z-score in arithmetic is .40. 
Multiplying .40 x 10 = 4. Adding this to 50 yields 54. In the example of 
pupil K in reading, we proceed thus: K's Z-score on reading is —.41. Mul- 
tiplying, —.41 X 10 = —4.1. We drop the decimal from —4.1, which 
gives —4. Adding this (algebraically), 50 — 4 = 46. 

Now these values, 54 and 46, show that John is four-tenths of a standard 
deviation above the mean and K is four-tenths of a standard deviation below 
the mean. This is precisely what was said before, but our values, 54 and 
46, are whole, positive numbers. These scores are sometimes erroneously 
called T-scores. We shall refer to them simply as sigma scores to distinguish 
them from Z-scores. 

There are two reasons for going into detail with respect to Z-scores and 
sigma scores. We have already said that these scores are the best that have 
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been devised for comparison of results on tests by individual pupils. The 
second reason is that more and more of the makers and publishers of stand- 
ardized tests are expressing and interpreting results in terms of standard 
scores, usually in some form of what we have designated as sigma scores. 
Although not many teachers may have occasion to determine standard 
scores from their own tests, it is essential that all prospective users of stand- 
ardized tests understand standard scores. Such an understanding is the 
primary purpose of the explanation given here, 


* Learning Exercises e 


6. The same 24 ninth-grade pupils (see Learning Exercises 3, 4, 5, page 39) 
made the following scores on a test of English fundamentals: 35, 31, 28, 33, 40, 24, 
21, 26, 29, 24, 31, 27, 32, 26, 25, 28, 22, 30, 26, 38, 30, 23, 18, 31. What is the mean 
of these scores? 

7. Compare the ranges of the two sets of scores, Which is larger? 

8. Calculate the semi-interquartile range of each set of scores. Suggestion: 
one-fourth of each series is six scores. Cut off the top six and lowest six, and find Q 
from the remainder. What do the two Q values tell you? 

9. For the two sets of scores assume the following: 


Worn MEANING ENGLISH 
Mean 41.9 28.2 
c 8.7 5.1 


Calculate (a) percentile ranks, (b) Z-scores, and (c) sigma scores for the following 
members of the class: 


Score Score 
Puri Wonp MEANING ENGLIsH 
D 37 25 
J 60 . 40 
K 45 30 


Compare these three pupils on each type of score. Which is simplest to calculate? 
Which type gives the most meaningful comparisons? Justify your answer. 


MEASURES OF CORRELATION OR RELATIONSHIP 


We often hear the term correlation used, or see it in print, in educational 
and psychological discussions and literature. Reference is made to a 
high correlation or a low correlation or to no correlation. If such phrases are 
to have meaning it is necessary that the term correlation, and use of high, 
low, or no with reference to this term, be given some meaning. 
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Meaning of Correlation 

Correlation is a method of determining the degree of relationship between 
two traits or quantities that can change or vary in amount. For example, 
people vary in height, weight, intelligence, industry, and in countless other 
ways. Each of these qualities is a variable. If two of the variables, height 
and weight, are selected for study it is possible to say on the basis of obser- 
vation that there seems to be a degree of correspondence between them. 
That is, tall persons tend to be heavier, short persons tend to weigh less. 
The correspondence is not perfect because there are short stout and tall 
slender people, but in the main it can be said that there appears to be a defi- 
nite relationship between height and weight. Correlation is a mathemat- 
ical procedure for determining and expressing quantitatively the closeness 
of the relationship between two variables. The measure of this relation- 
ship is called the coefficient of correlation. It ranges from a maximum of 
+1.00 through zero to a mazimum of —1.00. If the two measured variables 
tend to vary together, as in the case of height and weight, the relationship 
is direct and positive, and approaches +1.00 as a maximum. If the rela- 
tionship is indirect and negative, that is, if an increase in one variable tends 
to be accompanied by a decrease in the second, the coefficient will be nega- 
tive, approaching —1.00 as a maximum. If there is no tendency for the 
traits or qualities to vary simultaneously either directly or inversely, the 
correlation is zero, or we say there is no correlation, or that there is absence 
of relationship. 

The correlation of +1.00 denotes a perfect positive relationship. This 
exists when a change in one trait is always accompanied by a commensurale 
change in the same direclion in the other trait. Likewise, a correlation of 
—1.00 denotes a perfect negative correlation which means that a change in 
one trait is always accompanied by a commensurale change in the opposile 
direction in the other trait. Perfe?t correlations, either positive or nega- 
tive, are rarely found in educational measurements, and it may be added 
that negative correlations of any size are quite uncommon. 

An illustration of a positive correlation has already been given. The cor- 
relation between age of automobiles and their cash value would be negative 
since, as age increases, value will decrease. The correlation between age of 
school children and intelligence quotient would be zero since we will ob- 
viously find no relationship between age and T.Q. in children. 

One of the most useful and easily understood devices for showing rela- 
tionship between two variables is called a scatter diagram. In such a dia- 
gram the individuals who are being measured are located or plotted with 
respect to both variables at the same time. This kind of chart may also be 
used as the basis for one of the best methods of determining relationship, 
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the Pearson Product-Moment Correlation, or r. The details of calculating r 
from such an array are explained in Appendix A. 

To make clearer the nature of correlation and different values of the 
coefficient several scatter diagrams are shown below. In each case, there 
are two variables involved. In the first and second, the results are based on 
measurements of university students; in the third, on measurements of 
weather conditions and fuel. 

In Figure 1 the scatter diagram shows what would be called a moderate 
positive correlation. There is a distinct tendency for high scores on one 
test to be accompanied by high scores on the other, and vice versa. How- 
ever, some individuals who make high scores on the reasoning test make 
average or even low scores on the factual test, and the opposite is also true 
in some instances. If such exceptions did not occur, the correlation would 
be +1.00. This moderate positive correlation is quite typical of what is 
found when mental measurements are correlated. The correlations between 
various school subjects, and between 7.Q. and achievement test results, tend 
to be in the general vicinity of .30 to .50. 


Figure 1 


Scatter Diagram Showing Positive Correlation 
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SCORES ON FACTUAL TEST 


Each tally represents a person located according to his score on a 
factual test and his score on a reasoning test in educational meas- 
urement. r = .51 
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Figure 2 shows the relationship between two other variables — in this 
case, scores on a test of manual dexterity and scores on a vocabulary test. 
The correlation is practically zero, indicating that there is no evident rela- 
tionship. In other words, knowing an individual’s score on either test would 
provide no basis for estimating or predicting his score on the other. 


Figure 2 
Scatter Diagram Showing Negligible Correlation 
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Each tally represents a person located according to his score on 


a il of manual dexterity and his score on a vocabulary test. 
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Figure 3 shows a negative correlation of —.74 between average daily 
temperature and tons of coal burned per day, a fairly high degree of inverse 
relationship. The data are hypothetical and serve only to illustrate a nega- 
tive or inverse relationship. Such a relationship is obvious in this case, since 
the higher the mean temperature for any given day, the lower would be the 
amount of coal burned for heating purposes. What the true correlation 
would be is not determined. It would, of course, be affected by other atmos- 
pheric conditions such as wind, humidity, and amount of sunshine, as well 
as other factors. 

It should be clear that theoretically one may have high negative or even 
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Figure 3 


Scatter Diagram Showing Negative Correlation 
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Each tally represents one day located according to average 
temperature and tons of coal burned in a community during the 
twenty-four hour period. r — — 74 


perfect negative correlations, just as high or perfect positive correlations 
are possible. In actuality, however, perfect correlations are rarely found 
in educational measurement. It shouid also be noted that a negative 
correlation designates just as close a relationship as its positive counter- 
part. A correlation of —.70 is just as high for predictive purposes as one 
of +.70. It is the size of the correlation, not its direction, that determines 
how close the relationship is. 


Further Ideas about Correlation 

The coefficient of correlation is an index or pure number which gives a 
measure of the degree of relationship between variables. Most commonly, 
only two traits or quantities are considered at one time, though there are 
methods of determining correlation between more than two. The correla- 
tion coefficient is nol a percent. A correlation of .60 does not mean 60 per 
cent of perfect correlation. As has been said, it is simply a number or index 
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which can vary from +1.00, through zero, to —1.00. We may repeat here 
for emphasis that the amount or size of the correlation coefficient expresses 
the degree to which two traits tend to vary simultaneously, or the extent to 
which increases (or decreases) in one tend to be accompanied by increases 
(or decreases) in the other. 

It may be appropriate to mention one or two additional cautions concern- 
ing correlation. The student should remember that a relationship between 
two variables does not prove them to be causally related. They may vary 
together or tend to do so because of a third factor that affects both. Age is 
frequently such a factor. For example, a substantial positive correlation 

-might be found between height or weight and mental maturity among 
school children. Such a correlation would be the result of the effect of age 
on physical growth and mental growth, and not evidence that increase in 
height or weight causes mental growth. When the factor of age is eliminated 
or “held constant,” the correlation between height or weight and mental 
age is approximately zero. 

Tt should also be clear that correlation can be determined only where 
there is some basis for relationship, as in the case of the same group being 
tested twice. There is no basis for correlation between two different groups 
of persons being tested even with the same test.? This fact seems self-evi- 
dent, yet there is a common misconception that a coefficient of correlation 
may be determined from test scores of two different classes or groups. 

There are many kinds of correlation coefficients, each with particular and 
specific uses, but the most common and simplest is the one we have discussed 
here, known as a linear correlation between two variables. There are also 
other methods of calculating linear correlation besides the product-moment 
method. However, the basic concepts are the same, no matter which method 
is used. The product-moment method and one other method are described 
and illustrated in Appendix A. 
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Size of Correlation Coefficients 


A question of concern to everyone using correlations has to do with the 
size of the coefficient, Since the coefficient can vary in either direction from 
zero to 1.00, it is important that we try to determine what is a high correla- 
tion, a moderate one, a low one. No simple and rigid rules or answers to 
such questions can be given. A correlation may be high in one situation 
and only moderate or even low in another. Correlations between two equiv- 
alent forms of a test of achievement or intelligence may be .90 or higher, 

2 The only common exceptions are to be found in the study of twins’ or other relatives’ 
resemblance in specific traits, or in educational experiments where individuals in one 


group are paired or matched with individuals in another group on some basis such as 
intelligence. 
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sometimes as high as .98. On the other hand, correlations between T. .Q. and 
school marks will ordinarily not be higher than .50. Correlations between 
physical traits or abilities such as rate of tapping or strength of grip, and 
mental tests, will usually be not far from zero. Therefore, the interpreta- 
tion of a coefficient of correlation depends on the situation. 

Perhaps a more useful method of interpreting the size of the correlation 
coefficient is one which involves forecasting or predictive value. For 
example, if we know a pupil's level of intelligence, how accurately can we 
predict his achievement in algebra? Knowledge of a given pupil’s 7 .Q. score 
will never tell us exactly what his score will be on an algebra test. But if 
both tests have been previously administered to the same group of pupils, 
then the size of the correlation coefficient between the results of the two 
tests will give us some idea of how confident we can be later in predicting an 
algebra score from the score on the intelligence test. If the two tests prove 
to have a correlation of .90 or higher, then we would be fairly safe in pre- 
dicting that a given individual will have roughly the same score on the 
algebra test as on the intelligence test. For individual prediction, the corre- 
lation must be very high — at least as high as .90. 

On the other hand, if we wish simply to have some assurance of being 
right more often than wrong in group prediction, then correlations of .50 or 
even less are often quite useful. In such cases it is also possible to say with 
considerable assurance that one group will do better or worse than another. 

Another problem in prediction is this: if we measure a group of children 
on a test today, how closely will the results agree with those obtained 
from the same test given to the same children several days later? Here, 
where we are interested in knowing how consistent results of a test are, 
correlations of .80 or better are usually considered necessary to give assur- 
ance of acceptable stability or consistency. 

These matters will be considered further in the discussion of reliability 
and validity in the next chapter, and ágain in Appendix A. 


e Learning Exercises e 


10. Cite three illustrations each of (a) positive correlation, (b) approximately zero 
correlation, (c) negative correlation. 

11. Would you expect positive, negative or no correlation in each of the following 
comparisons? 


(a) T.Q. and marks in algebra 
(b) Speed and accuracy in addition 
(c). Scores on two equivalent forms of an achievement test 
(d) Age and 7.Q. within a school grade 
Q 
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(e) Age and I.0. over a range of grades 
(f) Cost of a product and the supply 


12. Make a scatter diagram of the scores on arithmetic and reading given for the 
first group in this chapter. (See pages 35 and 42.) Suggestion: write down the 
pupils’ letter designations and then in parallel columns each one’s score on arith- 
metic and reading. From this arrangement make a plot locating each pupil with 
respect to both scores simultaneously. Comparing the result with the three dia- 
grams given, what do you estimate the correlation to be? 

13. By reference to Appendix A you can calculate the correlation between reading 
and arithmetic scores. Does the result agree with your estimate? 

14. Knowing the size of the coefficient of correlation in this case, what can you 
say about predicting reading scores from arithmetic scores, and vice versa? 


QUOTIENTS AND NORMS 


The point has already been made that a single score on a given test is 
merely a number and has no meaning in and of itself. Such a score, called 
a raw score, must be related to something to give it meaning. We have al- 
ready discussed several kinds of relative scores: ranks, percentile ranks, and 
standard scores. However, there are a number of other kinds of relative 
scores, derived scores, or transmuted scores, two of which should be familiar 
to every teacher and counselor. These are quotients and norms. Both will 
be briefly defined and discussed here, and more detailed discussion of each 
will be found at appropriate points throughout the remainder of this book. 


Intelligence Quotient 


The Intelligence Quotient, or I.Q., has become a familiar term to the layman 
as well as to professional educators and psychologists. The term is fre- 
quently found even in newspapers and popular magazines. It is doubtful 
that the average person could give a precise definition of J.Q., however; 
most people know that it is supposed to be some indication of degree of in- 
telligence, but that is as far as their knowledge goes. 

The I.Q. or intelligence quotient is a ratio of mental age to chronological 
MA. 
CA.” 
that the individual grows or develops mentally at a steady rate from birth 
to a few years before maturity, and that different individuals may develop 
mentally at the same or different rates. There are other assumptions of a 
more technical nature which will be discussed later in connection with the 
material on jntelligence tests, but for present purposes the foregoing will 
suffice, 


age or life age. The equation is written T.Q. = This theory assumes 
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The rate at which an individual develops mentally is measured by deter- 
mining first the level of difficulty of tasks he can perform. For example, 
if A can do as much (that is, tasks as difficult) as the average child of 12, he 
is said to have a mental age or level of mental development of 12 years. 
Thus, if A is only 10 years old, it is obvious that he is developing at a rate 
which is faster than average, since he has reached the level of mental 
maturity in 10 years which the average child requires 12 years to attain. 
The rate of his mental development is then determined by comparing his 
mental level with his chronological or life age. So we say his /.Q. is 12 years 
divided by 10 years: 

M.A. _ 12 


I.Q. = GA 735719 


It is common practice to multiply the result by 100 to eliminate the 
decimal (move the decimal point two places to the right), and say his 7.Q. 
equals 120. To express the result in terms of the rate of mental develop- 
ment, we say that for every year A has lived he has added or grown 1.2 
years mentally. 

It is generally assumed that mental growth or development as measured 
by tests ceases on the average somewhere between 16 and 18, just as physi- 
cal growth ceases at a slightly later age. Consequently, A does not continue 
to add 1.2 years indefinitely ; instead, it is believed that this rate begins to 
decline at about 12 to 14 years, and beyond 16 to 18 the average scores on 
intelligence tests do not continue to increase. From then on, the 7.Q. is less 
useful than some other measures for indicating the degree of intelligence. 
However, there is some evidence indicating that while the rate of mental 
growth declines and finally becomes zero as stated, it may not cease entirely 
or reach its maximum until age 20. The question is controversial and 
different authorities disagree on the exact age, but there seems to be rather 
general agreement that the average maximum growth is reached somewhere 
between ages 16 and 20. It should also be noted that there is a wide range 
in the age at which mental maturity is reached. Of course some individuals 
reach maturity much earlier than others, but our figures here pertain to 
averages for the general population. j 

Theoretically, the range of /.Q.'s is from 0 to 200 or more. Since the 
norm or standard is the level of development of the average child of a given 
age, it follows that the average I .Q. must always be 100 so long as these 
norms or standards are kept up to date. Average /.Q.’s fall in the range from 
80 to 120; those from 120 to 140 are considered superior, and those above 140 
are indicative of potential genius. I.Q.’s of 80 to 90 are classified as dull and 
those from 70 to 80 as borderline defective. People whose J.Q.’s fall between 
50 and 70 are generally classed as morons, between 25 and 50 as imbeciles, 
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and below 25 as idiots. These limits are arbitrary and do not necessarily 
apply in every case, but they represent approximate classifications that have 
rather general acceptance. , . 

The concept of the T.Q. has been a very useful one, although it has always 
been more or less controversial. For various reasons it has fallen into dis- 
favor with some measurement experts, counselors, and psychologists, and 
there is a tendency to replace the 7.Q. concept with some other measure such 
as standard scores. Although this change would eliminate a number of 
vexing problems, it is not likely to come about for some time since the /.Q. 
is so well-established; standard scores as yet are not nearly so well-known 
or generally understood. 

It should be remembered in using /.Q.’s that any single test result is only 
an approximation. Every measurement, even in such exact sciences as 
physics and astronomy, is always subject to some error, and this fact must 
always be recognized in measuring human traits, particularly mental or 
psychological ones. Therefore, an J.Q. based on a single measurement or 
test should be regarded as only an approximate value which must remain 
tentative until it is substantiated or corrected through additional testing. 
A good rule to follow is that when we have an 7.Q. based on a single test we 
should consider that the chances are only fifty-fifty that the true value is 
within a range of about five points on either side of the obtained value. In 
other words, a measured I.Q. of 108 represents a probability of fifty per cent 
that the true value is somewhere between 103 and 113. Moreover, there is a 
twenty-five per cent chance that the true /.Q. is more than 113 and an equal 
probability that it is less than 103. This should certainly not be taken to 
mean that a single measurement of a child's T.Q. is of no value, but it does 
mean that such an 7.Q. must be regarded as an approximation which further 
testing and other facts about the child may or may not support and confirm. 
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Other Quotients 


Besides the 7.0. several other quotients are used in educational measure- 
ments. Some of the more common ones are: 


(1) Educalional Quotient, abbreviated E.Q. 


EQ. = E^ x 100, 
Educational Age (E.A.) is the age which corresponds to the score made 
by a pupil on an achievement test or battery. For example, if the score 
made by the individual pupil is equal to the average score on the test 
of pupils 9 years and 6 months of age, his educational age on that 
test is 93 years. If his chronological age is 8 years and 3 months his 
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114 (months) 


99 (months) X 100 = 115. 


i 9:6 

E.Q. = 323 X 100 — 
The educational quotient is usually based on a survey test or battery rather 
than a single test like a reading test or an arithmetic test. In the latter 
cases, the quotients would more properly be called the Reading Quotient or 
the Arithmetic Quotient. 


ji . . Reading Age 
Reading Quotient = O AINT x 100. 

y i . „ _ Arilhmelic Age 
Arithmetic Quotient = Oron lra Age x 100. 


One of the obvious limitations of the educational quotient arises from the 
fact that it is based on a pupil's average performance on a battery of achieve- 
ment tests. His scores on the separate subjects might be distinctly above 
the norm in some and even below in others. Though the pupil's educational 
age would reveal the central tendency, it would not reveal these variations 
which are perhaps the most significant facts about his achievement. 


(2) Achievement Quolienl, abbreviated A.Q. 


In the achievement quotient we see an attempt to relate accomplishment to 
ability. It is the ratio of two quotients, the educational quotient to the 
intelligence quotient. It suggests a device for determining to what extent 
the individual is achieving up to his capacity to achieve. 

Now, by ordinary arithmetic the expression for the A.Q. can be simplified, 


thus: 
E.A 
2 JEQQ = LA 
E.Q. C.A 
A.Q. - TO. x 100. Gs d 79 
24 Gag 
Substituting, 
E.A. 
AQ. = 7-4— x 100, 
A. 
or, 
EAA, .. C.A. E.A. 


AQ. = GA. x MA. X 100 = MA. X 100. 


Therefore, the A.Q. reduces to a ratio between £.A., an individual's level 
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of achievement, and M.A., his level of mental development. Theoretically, 
at least, no one should have an achievement quotient greater than 100. 
That is, no one would ever achieve more than he had the capacity to 
achieve provided our measures of capacity were perfect. Actually, however, 
achievement quotients above 100 are common. There are other aspects of 
the achievement quotient technique that make it somewhat questionable; 
it is presented here primarily for informational purposes. We may say that 
generally speaking none of the quotients mentioned in this section has the 
general acceptance or wide use of the T.Q. 


Norms 

A norm is the average performance on a test by a defined group, e.g., a 
given age or grade. If the norms are based on groups of children of the same 
age they are called age norms; if based on groups in the same grade they are 
called grade norms. As a rule, intelligence tests have age norms while 
achievement tests have grade norms. It is not difficult to see the reason 
for this. In calculating a child's J.Q. we need to know his mental age and 


Table Ill 


Mental Ages Corresponding to Scores on Otis Quick-Scoring 
Mental Ability Test: Beta, Forms A and B 


(Reproduced by permission of the World Book Company publisher.) 
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his chronological age. Mental age is determined by the average perform- 
ance of children of given ages. Thus, if it is found that the average score 
on an intelligence test of children exactly ten years old is 42, then a child 
who makes a score of 42 on this test is said to have a mental age of ten 
years since he can perform on this test as capably as the average ten-year-old. 
Reproduced in Table III is a table of norms for a well-known intelligence test. 

For achievement tests, the type of norm commonly used is the grade 
norm. Here we are interested in knowing the average performance of chil- 
dren at certain grade levels. For example, in establishing norms on a test 
of achievement in American history we may wish to set up norms for grades 
four to eight, inclusive, The simplest approach to the problem would be to 
determine the average score on the test for representative samples or groups 
at each grade level. 

Norms are not always based on or given in terms of raw scores. Both 
percentiles and standard scores may be used in tables of norms, as can be 
seen in Table V, page 58. 


Table IV 
Grade and Age Norms for Gates Basic Reading Tests, Type A 


Number of Number of 
Paragraphs Reading Reading | Paragraphs Reading Reading 
Correct Grade Age Correct Grade Age 


8-2 10-4 

8-3 s 10-8 
11-1 
11-5 
11-10 
12-2 
12-5 
12-10 
13-3 
13-9 
14-3 
14-9 


0 
1 

2 
3 
4 
5 
6 
vA 
8 
9 


Reading Grades are in Years and Tenths, Reading Ages in Years and Months. 
(Reproduced by permission of Teachers College, Columbia University, publisher.) 


It will be noted in Tables III and IV above that both age and grade 
norms are classified in terms of units smaller than a whole year. Many 
tables of norms use one month as the unit, both in age and grade. In age 
norms the divisions include the entire 12 months of the year, while in 
grade norms the range is usually from the exact year, such as 9.0, to 9.10, 
representing the last month of the school year. Thus, in mental age norms 
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Table V 


Scaled Score (Standard Score) and Percentile Norms for Cooperative 
English Test, Mechanics of Expression, in Public Secondary 
Schools of the East, Middle West, and West 


Mean 


Standard 
Deviation 8.3 8.6 8.7 8.9 


(From Cooperative English Test, by Janet Afflerbach and others. Copyri 5 
rem coe Testing Service.’ Reprinted by special Sanction ot Eduostional 

Testing Service.) 
one finds values like 7-10 or 9-11; these represent 7 years 10 months, and 


9 years 11 months, respectively. In tables of grade norms one might find 


a 
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5.4 or 7.7, representing the fourth month of the fifth grade, and the seventh 
month of the seventh grade, respectively. 

Tt should also be pointed out that in the case of the Cooperative English 
Test an intermediate step between scoring the test and using the table of 
norms is necessary. After the test is scored the raw score is transmuted to 
a scaled score by use of a conversion table on the answer sheet. These 
scaled scores are standard scores such as were described earlier in this chap- 
ter. Ina sense they are norms since they reflect the status of an individual 
score in terms of the mean and standard deviation of a normative popula- 
tion, These are converted in the table of norms to percentiles, principally 
because the latter are more easily and widely understood by test users than 
standards scores are. 

The subject of test norms includes many important and complex prob- 
lems which have not even been mentioned here for the reason that they are 
considered to be beyond the scope of this book. If the student has grasped 
what has been presented on averages, percentiles, standard scores, age 
norms, and grade norms, he will have a foundation of basic essentials under- 
lying practically all types of test norms he is likely to encounter in his work. 

The purpose of this chapter has been to present the elementary concepts 
and tools for interpretation of test scores as simply as possible. With these, 
any teacher should find it possible to interpret scores on tests of achievement 
and intelligence for his own pupils and classes. For those who desire it, a 
somewhat more detailed and technical presentation of elementary statistical 


methods will be found in Appendix A. 


e Learning Exercises e 


15. Using the table of norms for the Otis Bela Test (page 56), determine mental 
ages for the following scores: 12, 30, 49, 55, alid 61. If the ages of these individuals 
are respectively 6-9, 11-8, 14-0, 14-6, and 15-5, what are their J. .Q.'s? 

16. Referring to the table of norms for the Gales Reading Test (page 57), (a) what 
reading grades and reading ages would correspond to scores of 9, 16 and 23? 
(b) Reading grade 3.0 corresponds to reading age 8-6; reading grade 4.0 corresponds 
to reading age 9-8. How do you explain this apparent discrepancy? 

17. The table of norms for the Cooperative English Test, Mechanics of Expression 
(page 58), shows scaled scores and percentile norms for Grades 7 through 12. By 
reference to the table, answer the following questions: 

(a) What is the percentile rank in ninth grade of a scaled score of 50? In the 

seventh grade? In the twelfth? 

(b) What scaled score does an eighth-grade pupil have to make in order to receive 

a percentile rank of 72? What scaled score is required in the tenth grade to 


achieve the same rank? 
" Sa 
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(c) Assume that the mean of the scaled scores is 50. In what grade is the average 
achievement approximately equal to this? 
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Finding and Selecting 


Good. Measuring Instruments 


The title of this chapter poses a problem of concern to all people who want 
to use a test! for a specific purpose or in a certain situation. What is a good 
test of intelligence for fifth-grade pupils? Is there a reliable test for measur- 
ing interests of high school youngsters? How can I find out what tests 
and other measuring devices are available? How can I tell whether or 
not a measuring instrument is suitable for my purpose? These and similar 
questions confront teachers, counselors, and administrators who are re- 
sponsible for measurement in the schools. Unfortunately, the answers are 
too often based on hearsay or the opinion of an individual who may have 
little actual knowledge on which to base his statements. Undoubtedly a test 
is often chosen, purchased, and used on no better basis than the casual word 
of a fellow teacher, principal, or counselor. When a test is selected in this 
manner, the chances are that it will not prove to be entirely appropriate 
and satisfactory. In most cases, if the test is not successful, there is a danger 
that the test itself may be condemned when actually its failure to do the job 
properly is the result of its misuse. — «» i 

How then does one go about the intelligent selection of a test for a specific 
purpose? This question really should be considered in two major divisions: 
(a) sources of information about available measuring instruments, and 
(b) characteristics of a good measuring instrument. 

Each of these topics will be considered in some detail in this chapter, al- 
though the main part of the discussion will be concerned with (b), the 
characteristics of a good measuring instrument. 

1 Although the discussion in this chapter deals largely with tests, it should be under- 


stood that the principles presented apply to a considerable extent to measuring instru- 
ments and devices of all types. 
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SOURCES OF INFORMATION 


Publishers’ Catalogs 
For a first or preliminary survey of the available tests for any purpose, 
publishers’ catalogs are invaluable. Most catalogs supply enough informa 
tion about each test to enable the reader to decide whether he wishes to 
examine it. All publishers make available to authorized persons a specimen 
or sample set for a modest sum. This set usually includes a copy of the- 
test, a manual of directions, and a scoring key. By the purchase of such a 
set a prospective user can acquaint himself thoroughly with the test and 
the methods of use. All the publishers listed in Appendix B are reputable 
firms which will gladly send catalogs or lists of tests upon request. When 
corresponding with publishers about tests or test literature, one should 
write on school stationery, stating his official connection or position, because 
publishing firms do not usually send such materials to unauthorized persons. 
The catalogs and other advertising material distributed by the test pub- 
lishers generally contain fairly objective and reliable information about each 
test. In addition to information such as purpose for which the test is in- 
. tended, cost, number of equivalent forms, grade level for which the test is 
appropriate, and the type of norms available, the catalogs often contain 
statements concerning the construction of the test, authorship, ete. Most 
test publishers are restrained from making false or exaggerated claims for 


———————————————————————————— 


Anderson Chemistry Test 
By Kenneth E. Anderson, School of Education, University of Kansas. 


A measure of the extent to which students have 
achieved the important objectives of a high school 
course in chemistry. Each of two comparable 


Topics included are: chemical changes; solutions, 
elements, compounds; symbols, equations, prob- 
lems; atomic structure; ionization; organic chem- 


forms contains 80 test items, selected on the basis 
of curricular validity and satisfaction of statis- 
tical requirements. 

Each form ís divided into the following parts: 
Understanding of facts and concepts; Under- 
standing and application of functional principles; 
Understanding and application of the elements 
of the scientific method; Ability to use the basic 


istry; applications. 
Range. End-of-course test. 
Working time. 40 minutes. 


Scoring. With perforated stencil key by hand or 
machine. 


skills in chemistry. Norms. Percentile norms. 

T PER 35, ACCESSORIES. SPECI SET 
Anderson Chemistry Test FORM: net each, net eect 
Test Booklet . 
Answer Sheet. Me poss 
Expectancy Chi $0.02 


An Answer Sheet is required for each student tested. Test package contains Mi lof 
Directions and Key. Class Record is included with each 35 Anor Sheela! i53 t 
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a test by the very keen competition in this field and by the certainty that a 
test must prove itself to be good during a period of several years if it is to 
show a profit. Critical reviews by test specialists who are often in a position 
to recommend tests constitute another factor which leads publishers to be 
conservative in advancing claims for the quality of their tests. 

The sample entry on page 62, taken verbatim from the catalog of a pub- 
isher, will serve to illustrate the type of statements usually found in test 
publishers’ catalogs. 


Test Manuals 
Another important source of information concerning a particular test is, 

of course, the manual accompanying it. Test manuals usually contain the 
following information: 

(a) Purposes of the test 

(b) Bases for selection of content 

(c) Organization of the test 

(d) Directions for administering and scoring 

(e) Norms 

(f) Suggestions for interpreting scores 

(g) Suggested ways of using results of the test 

Once the prospective user of a test has decided upon one or several tests 

which he wishes to consider before making a final choice, he should order 
a specimen or sample set from the publisher. The specimen set, which usu- 
ally includes a copy of the test, the manual, a scoring key, and frequently 
other accessories such as a class record sheet, gives the most complete avail- 
able data about the test, and generally costs less than fifty cents. A speci- 
men set usually enables the prospective test user to gain enough knowledge 
of the test under consideration to form an adequate basis for a decision. 
Catalogs and manuals are important sources of information about tests, but 
it must be recognized that there is wide variation in the care with which 
they are prepared and the completeness of information provided. If the 
catalog description and the manual are both inadequate the prospective 
user must consult other sources. 


Other Publications of Test Publishers 


Some test publishers distribute to school personnel series of short articles 
dealing with practical problems of testing. A sample list of these articles 


follows: 


CarrronNiA Test BUREAU 
Educational Bulletin No. 10: Diagnosis in the Reading Program. Ernest W. 
Tiegs. 
t S 
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Educational Bulletin No. 14: The Proper Use of Intelligence Tests. Ernest W. 
Tiegs. 

Educational Bulletin No. 16: Guiding Child and Adolescent Development in 
the Modern School. Louis P. Thorpe. 


Hovcuton Mirriin Company 
Test Service Bulletin No. 10: Tests and Your Testing Program. Granville B. 
Johnson, Jr. 
Test Service Bulletin No. 13: Nature and Purposes of the Iowa Tests of Basic 
Skills. E. F. Lindquist and A. N. Hieronymus. 


Test Service Bulletin No. 14: Initiating a Testing Program. Frank B. Womer. 


PSYCHOLOGICAL CORPORATION 
Test Service Bulletin No. 36: What Is an Aptitude? Alexander G. Wesman. 


Test Service Bulletin No. 37: How Effective Are Your Tests? Jerome E. 
Doppelt and Harold G. Seashore. 


Test Service Bulletin No. 42: Does Testing Cost Too Much? George K. 
Bennett. 


Wonrp Book COMPANY 


Test Service Bulletin No. 65: Identifying Reading Difficulties. Gertrude H. 
Hildreth and B. C. Wadell. 


Test Service Bulletin No. 71: Retention and Forgetting During Summer 
Vacation. Helen P. Seward. 


Test Service Bulletin No. 75: Curricular and Instructional Implications of 
Test Results. James D. Leake and Walter N. Durost. 


Such articles apply the results of test research to the problems of 
teacher, counselor, or administrator. Copies of the articles are usually 
free and, although primarily intended to help the sales of the respective 
publishing houses, they are worth the thoughtful consideration of every 
person interested in problems of measurement. 


University Research and Service Bureaus 


Most universities have an office or bureau to which prospective users of 
tests may turn for advice and help. The members of the staff of such an 
organization are frequently able to lend copies of available tests from their 
files and give advice on the strengths and weaknesses of various tests. They 
are usually willing to meet with and help committees of school personnel 
authorized to choose tests for particular purposes. Such bureaus nearly 
always have files of test publishers’ catalogs and specimen sets or copies of 
many published standardized tests. The persons in charge of these organi- 
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zations are generally experienced test technicians and supervisors who can 
help inexperienced school people avoid mistakes and save time and money. 

However, one should not place too much reliance upon the help of such 
individuals, expert though they may be. Ideally, every teacher should be 
qualified to appraise tests objectively in the light of his own situation, and, 
in the final analysis, be competent to select his own tests. 

The persons connected with test and research bureaus, the instructors 
in courses in measurement, and even representatives of test publishers, 
whose job it is to help people select the best tests for the purpose at hand, 
will seldom try to force the choice of a particular test on anyone. These 
specialists will usually follow the more ethical procedure of making available 
several different tests with information about each, permitting the pro- 
spective user to choose the one best suited to the needs of his own situa- 
tion. 


Mental Measurements Yearbook and other Periodical Reviews 


The meticulous user of tests will often be unsatisfied with the information 
obtained from the sources already mentioned. Having examined catalogs 
and sample sets of tests, and talked with colleagues and perhaps even with 
an expert, he may still be undecided and want more facts. The best and 
most complete source of evaluative data on tests is the Mental Measurements 
Yearbook? new editions of which are published at frequent intervals. The 
Yearbooks have earned for themselves a unique position in measurement 
literature. Each edition contains impartial, critical reviews of the majority 
of available standardized tests. Moreover, the more widely used tests are 
reviewed in successive editions of the Yearbook to show how the tests com- 
pare over a period of years. No equally objective, reliable, and comprehen- 
sive source of information and expert opinion concerning published tests is 
available. The work of Oscar K. Buros and his staff is invaluable to all 
users of tests printed in the English language. 

Certain journals, such as Educalional and Psychological Measurement, 
publish reviews which, generally speaking, are more useful to the test 
specialist or research worker than to a classroom teacher or guidance coun- 
selor. However, when the prospective user of a test is undecided, the re- 
views found in such periodicals and those in one or another edition pf the 
Mental Measurements Yearbook may help materially in reaching a decision. 
It is much easier to find commentary on a particular test in the Yearbooks 


2 Oscar K. Buros (ed.), (1) The Nineteen Thirty-Eighl Mental Measurements Yearbook 
(New Brunswick, N.J.: Rutgers University Press, 1938); (2) The Nineteen Forty Mental 
Measurements Yearbook; (Highland Park, N.J.: The Gryphon Press, 1941); (3) The Third 
Mental Measurements Yearbook (New Brunswick, N.J.: Rutgers University Press, 1949); 
and (4) The Fourth Mental Measurements Yearbook (Highland Park, N.J.: The Gryphon 
Press, 1953). 
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since the reviews are contained in one volume. In the periodicals, of course, 
the various reviews appear in various issues. ; 

Having consulted all or most of the above described sources of informa- 
tion about tests, the prospective user should be in a favorable position to 
make a choice. He has first found out what is available to meet his needs; 
he has then gathered all the information he can from different sources about 
the tests which seem to hold possibilities for his purposes; he has related all 
this information to the situation in which the test is to function and the 
purposes for which it is to be used. In taking these steps, the prospective 
user has done everything possible, short of actually using the test, to insure 
success and efficiency in accomplishing his purposes. 


e Learning Exercises e 


1. Examine several catalogs of test publishers. How much information does each 
give concerning the criteria discussed in this chapter? 

2. Ask your instructor for a manual or, better still, a specimen set of a standard- 
ized test. Examine the manual or specimen set for information concerning the 
criteria discussed in this chapter. In what respects do you consider it adequate? 
In what respects (if any) do you consider it inadequate? 

3. Read the reviews of a particular standardized test (perhaps the one used in 
Exercise No. 2, above) in the latest edition of the Mental Measurements Yearbook. 
Write a short evaluation of the test, basing your evaluation on the Yearbook reviews. 


CHARACTERISTICS OF A GOOD MEASURING INSTRUMENT 


When a person is faced with the responsibility of choosing among two or 
three tests, all of which are available from reputable sources, how does he 
proceed to select one of them for use? For the sake of this discussion we 
shall assume that the tests under“¢onsideration appear to be equally suited 
to local conditions and that the strengths and weaknesses of the tests are 
fairly well balanced as far as the obvious and non-technical features are 
concerned. What, then, are the basic criteria of a more technical nature that 
may be used as guides in the selection of a test or other measuring device? 

All good measuring instruments have certain primary qualities in com- 
mon. These are the universals — the qualities which differentiate good 
tests from inferior ones — whether they be for use of the educator, the 
psychologist, the medical technician, the physicist, or people in other fields. 
A test which lacks a known and substantial degree of these primary qualities 
is not a measuring instrument in any true sense, and little or no dependence 
can be placed upon results obtained by its use. The two universals gener- 
ally agreed upon are reliability and validity. 
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Besides these two universal requirements for a good test, whatever the 
field, there are certain secondary characteristics which are desirable in all 
good educational and psychological tests: objectivity, ease of administration, 
ease of scoring, and ease of interpretation. These are far less crucial than 
reliability and validity, since a test may function efficiently without the 
presence of the secondary characteristics as long as it is valid and reliable. 
However, the secondary qualities to some extent affect validity and reliabil- 
ity and in any event make the use of a test much simpler. 

In addition to these secondary characteristics which are valuable in all 
educational and psychological tests, there are certain other attributes which 
are present in good standardized tests and which distinguish such tests from 
the informal classroom- or-teacher-made tests. These attributes or criteria 
are adequale norms, equivalenl forms, and economy. They are important for 
any good standardized test, though seldom applicable to unstandardized or 
teacher-made tests. 

The rest of this chapter is devoted to a discussion of each of the above- 
mentioned characteristics of measuring instruments. 


Reliability 

In discussions of the two universal criteria of a good test it is customary 
to think of validity as the most important quality and to discuss it first. 
However, since reliability is essential to validity and the opposite is not so, 
there is something to be said for placing reliability at the head of the list. 
A test may be reliable without being valid, whereas the validity of a test de- 
pends in part on its reliability ; therefore, a test is only as valid as it is reliable. 
Reliability refers to the consistency with which a test measures. The 
meaning of consistency in a test we shall clarify with a few illustrations. It 
was observed long ago that when an individual measured the diameter of a 
very accurately turned steel ball several times with an exceedingly accurate 
pair of calipers, he did not get exactly ‘the same result every time. Even 
with the most accurate instruments available and the best possible control 
of conditions, the successive measurements of the diameter of the steel ball 
always varied somewhat. The extent of such variations is a measure of the 
consistency, or the lack of it, in this measuring situation. 

In educational and psychological measurement we have another way of 
expressing and gauging consistency. In measuring the qualities of human 
beings it is seldom possible or even appropriate to determine the consist- 
ency of measurement by many repeated measurements of the same thing, 
as was done in the illustration of the steel ball. Since we are dealing with 
living, changing organisms, we cannot expect repeated measurements to 
show nearly such close agreement. Therefore, when dealing with people, 
we determine consistency by measuring a number of individuals — only 
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twice as a rule — and comparing the relative standings of the individuals on 
the two sets of measurements or scores. It should be noted that the two 
successive measurements are usually not more than a few days apart. 

To illustrate with a simple example, let us suppose we had given a group 
of seven children a test of clerical aptitude and ranked them according to 
their scores. A day or two later we repeated the test on the same group of 
seven children and ranked them again. The results might be as follows: 


Table VI 


Comparison of Scores Made by Seven Children on the Same Test 
Administered Twice 


First Testing Second Testing 
Pupil Score Rank Score Rank 


The degree of consistency of measurement can be judged here by the ex- 
tent to which the pupils tend to hold the same relative positions in their 
group. We can see that this tendency is high in this case since all pupils 
except F and G hold the same rank in both applications of the test, and 
even those two pupils shift only slightly. 

It should be pointed out that in this example all children show a gain in 
score between the first and the second testing, but their relative standings or 
ranks change in only two cases. If all individuals made the same score 
both times, or made lower scores the second time, the test would still show 
a high degree of consistency provided that the ranks of the individuals did 
not change. This, then, is what we mean by consistency. Conversely, a 
lack of consistency is shown by a situation in which individuals do not hold 
the same or similar relative positions in a group when measured twice with 
the same test. 

In determining the reliability or consistency of measurement of a stand- 
ardized test the number of individuals tested would usually be much larger, 
probably several hundred in all, but the principle would be the same. 
While there are various methods of estimating the reliability of educational 
or psychological tests, those most commonly used are based upon two 
measurements (or what is considered an equivalent procedure) of the same 
individuals, 
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Although we generally speak of reliability with reference to tests, this 
quality is equally important in other measuring and evaluating techniques. 
Rating scales, personality tests, interest inventories, or even questionnaires 
are of little value unless we can be sure that the results obtained by their 
use are reasonably reliable. 

Unreliability or inconsistency in a measuring instrument stems from two 
sources. These are, first, the situation in which it is used, including the 
physical and psychological state of the individuals tested, and second, the 
test itself, Such variable factors as conditions of testing, time limits, and 
directions can be fairly closely controlled provided that the persons using 
the test are willing to study directions carefully and follow them exactly. 
Conditions of the individual, such as fatigue, motivation, illness, and similar 
temporary factors, though not always as serious as sometimes imagined, 
are harder to control. There is no doubt that these conditions tend to re- 
duce the reliability of measurement, but it is also a fact that most of our 
good standardized tests show remarkably high reliability in spite of the 
operation of such factors. 

The principal factors in the measuring instrument itself which may affect 
the reliability of a test are the quality of the individual questions or items 
and the length of the test. Concerning the individual items, there are many 
ways in which the quality of the questions can affect reliability. For 
example, a question may be ambiguous; that is, it may be subject to more 
than one interpretation or it may be so worded that its meaning is simply 
not clear. 

The avoidance of ambiguity in test questions contributes materially to 
the attainment of a high degree of reliability, though even the most skillful 
test makers cannot always avoid this fault. In preparing test questions one 
should guard against vagueness and eliminate questions which prove on 
tryout to be ambiguous. Practice and experience in making tests, and a 
thorough knowledge of the subject mattët are the best preventives against 
ambiguity in test items. 

A second factor which is inherent in the test itself and which affects the 
consistency of measurement is the number of questions or the length of the 
test. Other things being equal, the reliability of a test is proportional to its 
length; that is, the longer a test is, the more reliable it tends to be. If 
we stop to consider the implications of this statement we see that it seems 
perfectly logical. The statement means simply that the more samples we 
take of a given area of knowledge, behavior, or material, the more reliable 
our appraisal of that knowledge, behavior, or material will be. A chemist 
would not think of basing his analysis of a carload of iron ore on a half dozen 
samples taken at one end, or at one or two levels as the car is unloaded. 
Instead, he systematically samples the iron ore at all locations and depths, 
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and then by a system of “coning and quartering,” reduces these many 
samples to one of a few ounces which he can take into the laboratory for 
analysis. Here he samples once more, basing his final judgment not on the 
analysis of one sample, but on duplicate or even triplicate samples which 
are carefully checked against each other for agreement. 

If we represent the areas of knowledge, behavior, or material to be 
sampled or tested by circles, as in the diagram below, the effect of more ade- 
quate sampling (or more items or questions) is evident. 


A B 


In A we have a sampling of the area by 10 items;in B by 19. It should be 
obvious that, other things being equal, the sampling in B will yield more 
consistent and reliable results than that obtained in A. 

Eventually we reach a point of diminishing returns when sampling is so 
thorough and reliability so high that additional sampling or testing does 
not improve reliability enough to justify the extra time and effort required. 
There is a formula for determining the relationship between the increased 
length of a test and its consequent increased reliability. This is known as 
the Spearman-Brown Prophecy Formula and is very useful in showing how 
much the length of a test must be increased to attain a desired reliability, or 
conversely, how much the reliability of a test will be increased if its length 

nr 
IO GSP where fan = 
estimated reliability when length of the test is increased n times, and 
r = reliability of the test in question. 

Thus, if a test of 100 items was found to have a reliability of .80 and one 
wished to know what its reliability would be if the number of items were 
doubled, then n = 2, and r = .80, and 


xiva iua S900 1.60 
"—l1c-0-1.8 1.80 


is doubled, tripled, etc. The formula is ran = 


-988. 


Reliability of measuring instruments is usually determined by one of 
three methods. All employ correlation techniques as discussed briefly in 
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Chapter 2. These three methods are self-correlation, correlation of equivalent 


forms, and splil-halves correlation. 


Self-correlation 


The example cited earlier in this section (page 68), in which a test was 
given twice to the same children with a day or so between testings, illus- 
trates the self-testing or test-retest approach to the determination of reliabil- 
ity. By calculating the coefficient of correlation between the two sets of 
scores an estimate of the consistency of measurement called the reliability 
coefficient is obtained. This test-retest method has the disadvantage of re- 
peating exactly the same questions to the same individuals, a procedure 
which may operate differently in individual cases. Some pupils may re- 
member many items and look up the answers they did not know, while 
others may forget the whole thing at once. Also, in this method of deter- 
mining reliability, “practice effect” is generally ata maximum. By “prac- 
tice effect" we mean the improvement of scores which results when the 
same pupils take the same test a second time. However, when this effect is 
uniform for all persons taking the test a second time, it does not influence 
reliability. 

Self-correlation is a generally-used and accepted method of estimating 
reliability, and it is very valuable in situations where no other approach is 
possible. Whenever the self-correlation method is used, it is particularly 
important to use a test of adequate length and to provide an interval of 
several days to a week between successive testings. 


Correlation of Equivalent Forms 

If two or more equivalent forms of a test are available, reliability is usu- 
ally measured by giving both forms to the same individuals and then cal- 
culating the coefficient of correlation from the two sets of scores. The two 
forms may be administered at one sittirig or two, depending upon the time 
required, age and maturity of the person tested, nature of the test materials, 
etc. In most cases it is customary to allow only a day or so to intervene 
between testings. For reasons which are not always pertinent to the ques- 
tion of consistency or reliability, longer intervals tend to reduce the corre- 
lation and thus give lower results than should properly be obtained. 

By employing two forms of a test we are virtually using two equal halves 
of the same instrument. Equivalent forms are the same in degree and 
range of difficulty; they cover the same areas of knowledge, skills, etc., even 
though they use different items or questions. They are as near alike in 
every respect as it is possible to make them. Equivalent forms eliminate or 
reduce to a minimum the practice effect which is present when the same test 
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Split-halves Correlation 

It is often impossible to employ either of the methods described above to 
determine reliability of a test. It may not be feasible to test twice, there 
may not be equivalent forms available, or there may be other valid reasons 
for not using either the self-correlation or the equivalent forms correlation 
methods. In such cases, we generally use what is known as the split-halves 
technique. In this procedure, the test whose reliability we wish to measure 
is given in the ordinary manner, the papers are scored as usual, and then 
two scores for each individual are obtained by scoring alternate halves of 
the test separately. Such scoring can be done in several ways. Probably 
the most commonly used method of obtaining two scores for each person is 
to base one score on only the odd-numbered items of the test and the other 
on only the even-numbered items. Thus in a test of 100 items one score for 
any individual would be based on the 50 items numbered 1, 3, 5, 7, . . . 99, 
and one on the 50 items numbered 2, 4, 6, 8,...100. Therefore if a pupil 
missed 10 of the 50 odd-numbered items and 12 of the 50 even-numbered 
items, his two scores would be 40 and 38. His total score would, of course, 
be 78, the sum of 40 and 38, or 100 — (10 + 12). 

Sometimes the test is split by selecting pairs of items which are thought 
to be equivalent and allocating one of each pair to each half of the test. 
Again, the test may be divided in the middle and the scores based on the 
first and second halves, respectively. Finally, tests may be split by allocating 
groups of items to alternate halves. Each procedure has advantages and 
disadvantages, and the choice of which one to follow must be determined 
on the basis of which procedure will give the most nearly equivalent halves 
of the test. 

Having obtained two scores for each person tested, we can then calculate 
the coefficient of correlation between the two sets of scores. This is, in 
effect, the correlation between the*wo equivalent halves of the test adminis- 
tered at one sitting. If the two halves are truly equivalent and if the test is 
a reliable one, the correlation thus achieved is likely to be fairly high. One 
step more is necessary to enable us to determine the reliability of the entire 
test, in this example a test consisting of 100 items. "The correlation coeffi- 
cient is based on scores which represent only halves of the test, that is, 
scores on only 50 items. Since we wish to know the reliability of the 100- 
item test, we now apply the Spearman-Brown Prophecy Formula, already 
discussed on page 70, and calculate the estimated correlation for 100 items, 
that is, for a test twice as long as our 50-item odd-even halves. In this 
manner it is also possible to calculate the estimated reliability of a test 
three times as long (150 items), or one of any other desired length. 

The split-halves method of determining reliability usually gives results 
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which are higher than those obtained by either of the other methods. For 
this reason, reliability coefficients thus determined are generally regarded 
as approaching the maximum value which could be reached under ideal 
conditions. If this fact is kept in mind, the method may be used in obtain- 
ing estimates of reliability in a wide variety of situations, often where no 
other method is possible. It should be noted that the method gives spuri- 
ously high reliability coefficients whenever test scores are dependent largely 
on speed; the split-halves method should not be used in such cases. 


Interpretation of Reliability Coefficients 


A very natural and important question concerns the desirable magnitude 
of reliability coefficients. Of course it is advantageous to have tests as reli- 
able as possible. As we have pointed out earlier, the degree of reliability 
is influenced by a number of factors, such as condition of the subjects being 
tested, length of test, nature of items, etc. Nevertheless, the best standard- 
ized tests of achievement quite consistently show reliability coefficients of 
.90 or higher. Standardized tests of intelligence commonly have reliabili- 
ties almost as good, generally .85 or higher. The reliability coefficients for 
instruments such as personality tests and interest inventories are usually 
lower, the average being most often in the .70’s and .80’s, although some have 
not attained a correlation as high as .70. 

Another consideration in interpreting reliability coefficients is the range 
or variability of the group. A test will have a substantially lower reliability 
when given to a group of children in a single grade than it will if administered 
to a group ranging over several grades. For practical purposes this means 
that the reliability of a test should be judged in terms of the circumstances 
under which its reliability was determined. A test with reliability suitable 
for a range of several grades or several years in chronological age will prob- 
ably be unsuitable for use in discriminating among children in the same 
grade or of the same age. For such purposes a test with proven reliability 
in the narrower range should. be sought. 

Desirable reliabilities differ also according to purpose. Where a test is 
intended only for use in studying groups, a lower reliability coefficient 
(around .75) may be sufficient to make fairly accurate comparisons. Where 
individual differentiation is the goal, reliabilities of .95 or higher are very 


desirable. 
Validity 
A second very important quality of any good measuring instrument is 
validity — the degree to which a test measures what it purports to measure. 


The definition of validity in a testing situation may be elucidated by such 
questions as these: What does this test actually measure? To what extent does it 
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measure (his particular ability, quality, or trait? In what situation or under 
what conditions does it have this degree of validity? 

In the discussion of reliability it was stated that a test can be reliable 
without being valid but that the converse is not true. In other words, it is 
conceivable that a test can measure some quality with a high degree of con- 
sistency without measuring at all the quality it was actually intended to 
measure, For example, a test might be devised that would require sorting 
of cards, as in dealing a pack of regular playing cards for a four-handed 
game, The test scores might take into account both speed and accuracy 
and have a high degree of reliability. Yet the test might have little or no 
validity for any specific purpose, such as ability to play cards or ability to 
work on delicate machinery. 

On the other hand, an unreliable test will never show a high degree of 
validity, for the validity of a test can not exceed its reliability. An unreli- 
able test cannot be expected to rank the same individuals twice in the same 
or nearly the same order, Obviously, if no dependence can be placed on the 
consistency of results obtained by use of a particular test, one can never 
arrive at any sound judgment of what the test actually measures, and the 
test therefore has little or no validity. 

The discussion of the relationship of validity and reliability leads to an- 
other important fact regarding validity, namely, that validity is specific to 
the purpose and situation for which a test is used. A test might be a highly 
valid measure of intelligence for third-grade children and decreasingly 
valid for this purpose with fifth-graders, ninth-graders, high school graduates, 
and college seniors. Again, a test of manual dexterity might be a highly 
valid measure of probable success in assembling parts of small electric 
motors, but decreasingly valid for predicting success in farming, selling 
automobiles, managing a printing establishment, or teaching higher mathe- 
matics. A test which measures the thinking ability of seventh-grade pupils 
might well be almost a pure memory test for older persons who have been 
out of school for a long time. The validity of a test, assuming it is reliable, 
is a measure of the extent to which it serves its intended purpose. A test 
may be highly valid for one purpose and almost wholly lacking in validity 
for another. In the same way that a thermometer is used to measure 
temperature only, and a barometer to measure atmospheric pressure only, 
each testing instrument provides valid measurement for specific purposes. 

There are several widely accepted methods of assuring or determining 
the validity of measuring instruments used in educational or psychological 
work. These methods are classified for the sake of convenience into several 
categories which we may call curricular validity, logical validity, and sta- 
listical or empirical validity, Fach of these will be briefly discussed and 
illustrated. 
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Curricular Validity 


When a teacher gives a test which deals with the material and to some 
extent with the objectives of instruction in a particular class, his test is said 
to have curricular validity. Let us suppose that he has been teaching the 
rules for writing formulas of chemical compounds and has illustrated those 
rules with a variety of actual formulas showing how each one is written in 
accordance with the system of chemical symbols, the valence of the ele- 
ments, etc. After perhaps a week of such instruction he makes a test whose 
questions are based on the same rules and on the same or similar formulas 
that he has been teaching. Under such circumstances the curricular 
validity of this test is taken for granted. 

Tf achievement tests are based on what has been taught they may be 
assumed to have curricular validity. Often tests will go beyond the knowl- 
edge, skills and other immediate goals of instruction and attempt to measure 
more remote or ultimate goals, such as behavior in a situation where the 
student must apply what he has learned. The validity of such tests is im- 
proved to the extent that this attempt to measure applied knowledge or 
some other remote or ultimate quality is successful. 

In constructing achievement tests and batteries it is common practice 
to give careful consideration to analyses of textbooks, courses of study, and 
other instructional materials to insure that the tests will have curricular 
yalidity. The degree of such validity is proportional to the extent to which 
the tests measure the goals of instruction that are common to courses of 
study and textbooks in the subject. 

The efforts to construct tests for measuring ability to understand, inter- 
pret, and apply — rather than ability to memorize facts — emphasize the 
problem of verbalism, concerning which much has been said and written in 
recent years. A test which measures only the ability to parrot what has 
been memorized has less curricular validity than one which measures under- 
standing, ability to interpret and to apply. 

Students who criticize an examination on the grounds that it measures 
only capacity for memorizing facts may have a legitimate complaint, On 
the other hand, it may be argued that a student cannot understand or apply 
what he has not learned. If the ultimate objectives of instruction go beyond 
memory for facts, then our tests, to be valid in a curricular sense, must also 
go further in what they attempt to measure. If and when they do, the gain 
for education will be in two directions at once. The tests will be better be- 
cause they will have greater curricular validity, and instruction will be 
improved because teachers will tend to go beyond mere verbalism to stress 
broader and more functional outcomes. 
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Logical Validity 

Some writers on the subject classify curricular validity as a type of logical 
validity, but in our discussion we shall attach quite different meanings to 
the two terms. We have used curricular validity to designate that type of 
validity which is based upon the goals of instruction, whereas psychological 
or logical validity we shall think of here as the kind of validity which is 
based upon such procedures as job analysis, or introspective analysis. 
Logical validity, which applies less often in the validation of school achieve- 
ment tests, is most frequently of concern in the validation of intelligence 
and aptitude tests. 

The well-known Seashore Test of Musical Aplitude will provide a good il- 
lustration of our meaning of logical validity. This test consists of a series of 
records presenting tasks dealing with pitch discrimination, time intervals, 
rhythm patterns and other basic musical abilities. These are not generally 
the subject matter of instruction in music classes, yet the author by logical 
analysis has resolved the ability to perform well, musically speaking, into a 
few fundamental aptitudes and has incorporated in his test objective meth- 
ods of measuring these aptitudes. The validity of this test may be thought 
of in two ways: first, as a measure of ability in pitch discrimination, etc., and 
second, as a measure for predicting success in a musical field or career. Va- 
lidity is logical in both aspects of the test, since both depend on the degree 
of success attained in logical analysis of musical talent. Similar examples 
can be drawn from such fields as mechanical aptitude and clerical aptitude. 
Tn each case the test is based on an analysis of those abilities, qualities, or 
traits that enter into successful performance. The logical validity of the 
tests is proportional to the degree that the analysis is accurate and com- 
plete, and to the degree that the tests reproduce and measure the actual 
skills involved. 

Tn the case of intelligence test$; logical validity is attained to the extent 
that the tests contain tasks which actually require intelligence to perform 
successfully. The nature of these tasks is usually arrived at by introspective 
logical analysis. Alfred Binet, the originator of our present-day approach 
to the measurement of intelligence, was the first person to arrive at a logical 
analysis of what an intelligent act involves, psychologically speaking. He 
came to the conclusion that “to judge well, to comprehend well, to reason 
well, these are the essentials of intelligence.” Binet and his co-worker, 
Théodore Simon, proceeded to devise tests which seemed to them to 
measure these abilities. Their efforts were so successful that the basic ap- 
proach they used has scarcely been improved upon to the present time. 

To sum up what we have said about logical validity, it may be stated 
that, as here interpreted, logical validity is pertinent wherever curricular 
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validity cannot readily be established. Particularly in the measurement of 
interest, aptitudes, and intelligence, logical analysis is often the only basis 
upon which a test can be constructed. Such a test may be further validated 
by empirical methods, as will be shown presently, but in the beginning at 
least, it may be necessary to rely upon logical analysis for evidence of 
validity. 


Empirical Validity 

This type of validity generally concerns some criterion of success outside 
the testing situation. In the case of an achievement test in arithmetic, for 
example, one type of empirical evidence of validity would be a comparison 
of marks given in arithmetic with scores made by the pupils who took the 
test. It would be possible to work out a coefficient of correlation between 
the teacher’s marks and the actual test scores as a measure of the validity of 
the test. However, it is necessary to recognize the limiting factors in such 
a comparison. These are, first, the reliability of both the test and the 
teacher’s marks (criterion) and, second, the validity of the teacher's marks 
(criterion). If the teacher's marks are perfectly reliable and valid measures 
of achievement in arithmetic, and the test scores are perfectly reliable, the 
resulting correlation between marks and test scores will be an accurate 
measure of the validity of the latter. The less reliable and valid the 
teacher’s marks and the less reliable the test, the lower will be the maximum 
validity which the test can show empirically. Because of these limiting 
factors, validity coefficients are generally not as high as reliability coeffi- 
cients. In the example given, a correlation of .50 between the teacher's 
marks and the test scores in arithmetic would be typical, even though the 
test might have a reliability of .90, and the teacher's marks a reliability of 
/10 and a validity (when compared with a true measure of achievement in 
arithmetic) of .40. 

Correlation coefficients between test scores and other criteria are generally 
accepted as empirical evidence of validity. With intelligence tests, correla- 
tions between J.Q.’s and various measures of scholastic success are frequently 
cited as evidence of validity of the tests. Similarly, correlations between 
test scores and ratings by teachers, between test scores and job success, 
and between test scores and results of other tests or measures judged to have 
validity — all are commonly used as empirical evidence of the degree of 
validity of tests. 

Another variation of the empirical approach to validity is the use of what 
are often called widely-spaced, or crilerion, groups. Let us suppose that we 
wish to have some evidence of the validity. of a so-called adjustment or per- 
sonality inventory. We might proceed, under this plan, to have a group of 
persons rated according to individual adjustment by those who know mem- 
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bers of the group intimately. The result would probably be a spread or dis- 
tribution of the individuals from one extreme (very well adjusted) to the 
other (very poorly adjusted). Next, we would take the highest 10 per cent 
and the lowest 10 per cent of this distribution and give those persons the test 
we are attempting to validate. We would then score the papers of the two 
groups and compare them as groups. If our test is valid as a measure of 
adjustment of individuals, we should expect the best 10 per cent to make a 
much better average score than the lowest 10 per cent. Furthermore, we 
should expect to find little or no overlapping between the two groups. 

The same procedure has often been used in validating tests of aptitude 
and even of interests. In the latter case, tests or items are validated on 
the basis of the degree to which they differentiate among persons engaged. 
in quite different occupations, and the extent to which patterns of specific 
likes and dislikes are associated with different occupational groups. For 
example, an item such as “being an actor ” might reveal marked differences 
in the percentage of likes and dislikes among different occupational groups. 
In this case, 75 per cent of a group of male school teachers might say they 
would like “being an actor," 10 per cent might be indifferent, and 15 per 
cent say they would not like it. By contrast, comparable proportions for a 
group of successful civil engineers might be 30 per cent, 20 per cent, and 
50 per cent. The item “being an actor” might then be judged a valid one 
for differentiating between the likes and dislikes of male teachers and civil 
engineers, while an item which showed no such clear differences would be. 
judged invalid for this particular purpose. A valid pattern of likes and 
dislikes among specific occupational groups might thus be built up. When 
the test is administered, a reversal of this procedure takes place, in that the 
individual's responses are analyzed to determine whether they fit the pat- 
tern of likes and dislikes established for any specific occupational group. 

There are other empirical procedures for determining validity of tests, 
but these will not be discussed héfe. What has been said about both relia- 
bility and validity should give the average teacher, counselor, or adminis- 
trator a sound basis for judging the worth of a published standardized test 
with respect to these two all-important criteria. But we hope that the dis- 
cussion will also help the student establish an objective attitude and sound 
procedures in his own test-making — an attitude and procedures which will 
ultimately improve the reliability and validity of every informal or teacher- 
made test he helps to construct.* 

Aside from the two universally applicable criteria which have been dis- 
cussed, certain secondary attributes, important in all good tests in education 


3 For another treatment of validity and reliability see: Committees on Test Standards 
of the American Educational Research Association and the National Council on Measure- 
ments Used in Education, Technical Recommendations for Achievement Tests (Washing- 

_ton, D.C.: American Educational Research Association, 1955), pp. 15-32. 
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and psychology, have already been mentioned and will be briefly discussed 
below. These characteristics are: objectivity, ease of administration, ease of 
scoring, and ease of interpretation. 


e Learning Exercises e 


4. ‘Define reliability as it applies to a measuring instrument. How would you de- 
termine the reliability of a semester examination in ninth-grade civics? What would 
you consider a satisfactory reliability coefficient in this case? 

3 What can you say about the relationship between reliability and validity? 

5. Distinguish between curricular, logical, and empirical validity by giving an 
datis of each approach. 

7, Why are validity coefficients consistently lower than reliability coefficients of 


standardized tests? 
Objectivity 

A test is objective to the extent that competent persons agree on the 
scoring of answers. Put in another way, a test may be said to be objective 
to the extent that the opinion or judgment of the scorer is eliminated from 
the scoring process. Objectivity is usually attained by (1) stating the ques- 
tions specifically and precisely, (2) requiring specific, precise, short answers, 
and (3) scoring the test by use of a previously determined scoring key. 
This key may be printed, in which case most of the scoring can be done by 
clerical workers; or, if properly prepared answer sheets are used, the test 
can be scored by a machine in a matter of a few seconds. In extensive test- 
ing which involves thousands of cases, the test-scoring machine is a means 
of saving much time and money. 

Objectivity is a matter of degree; few, if any, tests are either wholly ob- 
jective or wholly subjective. The conventional essay test, which consists 
of a few questions asking the student to“¢discuss,” “describe,” "give rea- 
sons for,” etc., is relatively lacking in objectivity. Essay tests emphasize 
such matters as judgment, opinion, and interpretation, both on the part of 
the student and the person who evaluates his answers. It is inevitable that 
different persons, all judged to be competent in the field or subject, will 
evaluate the same examination paper quite differently. Yet much can be 
done to make this type of examination more objective by careful phrasing 
of the questions, prior preparation of ideal answers, prior agreement among 
readers or judges on rules for evaluating answers, and by other such precau- 


tions. The maximum change in the direction of objectivity is achieved by 


the use of objective questions designed to eliminate ambiguity in the 
answering of such items, and by the use of mechanical scoring with previ- 


- ously prepared scoring keys. 
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A questionable reliability is the most serious weakness of subjective tests, 
since, as has been said, reliability is directly affected by the degree to which 
the judgment, biases, and emotions of the scorer enter into his evaluation 
of the answers. It is only fair to the individual whose test is being evaluated 
that these personal factors be kept to a minimum. 

Objective tests have often been severely criticized on the basis that they 
emphasize and measure only specific, unrelated bits of information instead 
of broader concepts, inter-relationships, understanding, and ability to inter- 
pret. Too often, the critics claim, such tests encourage only the memoriza- 
tion of miscellaneous facts. Although the evidence on this point is not 
voluminous and is often conflicting, at least one thing seems clear: no one 
has ever demonstrated that the broader aspects of learning, such as inter- 
pretation and application of knowledge, cannot be measured objectively. 
Tn fact, there is much evidence to support the claim that objective tests can 
be and have been devised to measure some of these broader outcomes of 
education. If objective tests on the whole have been based on the narrower, 
more specific goals of instruction, it is largely because the makers of these 
tests have lacked the necessary skill, vision, and perhaps the motivation to 
base their questions on larger concepts. If a teacher sincerely believes in 
the importance of the broader goals he can do much to encourage their 
attainment by emphasizing them in his own tests. 

Objectivity is essential to published standardized tests if the tests are to 
achieve maximum effectiveness. It would be very difficult, perhaps useless, 
to attempt any comparative interpretations of test results not arrived 
at objectively. Comparisons with norms, comparisons between individuals, 
classes, schools, sexes, regions, or grades, would be largely meaningless 
if each user of the test scored it according to his own independent ideas of 
the quality of the responses. One of the chief values of the standardized 
test is that personal judgments and biases are largely eliminated in the 
scoring and in the interpretatior èf results, 


Ease of Administration 


Tt seems reasonable to assume that the simpler a test is to administer, the 
less is the probability of making mistakes which will affect the results. Con- 
trast, for example, two well-known group intelligence tests, the Army Alpha 
and the Olis Self-Administering. Both tests were at one time among the 
most widely used tests of intelligence. 

The Army Alpha consists of eight parts or sub-tests, each one precisely 
timed. The time limits are all short, varying from 14 minutes to 4 minutes. 
Since each part must be timed exactly with a stop watch and all participants 
are supposed to begin and stop work on each part of the examination at the 
same time, the administration of the test places a great burden on the 
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examiner, especially if he is not highly trained in administration of group 
tests. The test lends itself to frequent errors in timing, misunderstanding 
of directions by examinees, and failure of individuals to begin and stop 
working at the right times. 

The Otis Self-Administering Test Series, published about the same time 
as Army Alpha, presents a notable improvement over Army Alpha in the 
matter of ease of administration. The Otis tests are not subdivided; all 
questions are together in a single set or part. The preliminary directions ex- 
plain thoroughly each type of item the examinee will encounter in the test, 
and provide opportunity for him to receive further explanations when 
needed. When all examinees are ready, the signal to begin is given, and 
work proceeds uninterrupted until the time allowed for the entire test has 
elapsed. These tests are properly called “self-administering.” Any intelli- 
gent, conscientious person can administer them correctly with just a little 
preliminary study of the directions. No additional equipment, not even 
a stop watch, is needed. 

In the more than twenty-five years since the Olis Self-Administering 
Tests were first published many changes have been made in group tests, but 
it is doubtful that any subsequent tests are easier to administer. This 
criterion, while not of supreme importance, is nevertheless one that test 
authors should meet more adequately than they do. A little forethought 
and planning concerning the administration of the test will often eliminate 
difficulties which otherwise might plague thousands of users as long as the 
test is available. 

Once the test is standardized there is usually nothing that can be done to 
improve directions for administering it. To simplify the administration of 
a test would be to change it, and this would upset the norms and other 
standardization data and procedures. Therefore, it behooves the authors 
of tests to give careful thought to the nature and organization of the test, 
as these affect its administration, in orderto eliminate all unnecessary com- 
plications and difficulties. Likewise, the prospective user of the test should 
carefully study each test under consideration, keeping this criterion in 
mind, Other things being equal, it is only common sense to choose the test 
which is the simplest to administer. This will generally save time, and the 
results will probably be more accurate since there is less likelihood of mis- 
takes occurring in an easily-administered test than in a more complicated 
one, especially when the test is to be used by persons not highly trained and 


skilled in test administration. 


Ease of Scoring 


Much of what we have said regarding ease of administration applies with 
equal force to ease of scoring. In fact, the test which is simple and easy to 
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administer may also be easy to score, though this is not necessarily so. 
Sometimes administration and scoring are complicated because the test is 
designed to yield sub-scores or part-scores for diagnostic purposes. In such 
cases it is obvious that ease of administration and scoring must be sacrificed 
in order to achieve some other more important aim. 

Another factor which affects ease of scoring is objectivity. One can work 
out the directions for a test ever so carefully, organize it as well as possible, 
and yet achieve only a difficult and burdensome scoring system because of 
the lack of objectivity of the test itself. If the items do not permit specific, 
objective answers, scoring will be difficult. Of course where scoring is done 
mechanically, nothing but objective types of items can be considered for use. 
Yet, surprising as it seems, a number of widely-used standardized tests still 
include such relatively non-objective types of items as completions or fill- 
ins. As anyone who has used such tests knows, these items always make 
scoring difficult. 

The use of properly prepared stencils, scoring keys, etc., will also do 
much to simplify test scoring. Test publishers have developed various 


schemes for simplifying the scoring of standardized tests. For example, the’ 


Clapp-Young Self-Marking Tests * were, so far as the writer knows, the pi- 
oneers in the use of a double sheet. The pupil marks his answers on the 
front and back of the closed double answer sheet or booklet. Parts of the 
inside surfaces of the double sheet are printed with a carbon ink strip facing 
printed squares or other symbols which designate the correct answers. 
Thus, when the student pencils his markings on the front and back of the 
closed double sheet the carbon strip inside carries the impression of the 
mark to the area of the printed symbols indicating the correct answers. 
Upon completion of the test, the scorer, when he has separated the double 
sheet, can quickly tally the correct answers by counting the number of 
carbon impressions which coincide with the printed ones. Similar systems 
are used by other test publishers? 

Where the new type of answer sheets are used, publishers generally pro- 
vide with the test a perforated scoring stencil for hand scoring. This stencil 
fits over the answer sheet and has holes so placed that the pencil marks of 
correctly answered items are visible. Scoring can be done quite rapidly by 
fitting the proper stencil over the filled-out answer sheet and simply count- 
ing the number of marks that appear in the holes of the stencil. The same 
or similar stencils are also used for machine scoring. When scoring tests 
in this way it is necessary to check each answer sheet to see that no one has 
marked more than one space in answering any item. Teachers will often 
make their own scoring stencils which greatly facilitate the task of scoring 
examinations. À 


4 Published by Houghton Mifflin Company, Boston. 
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Ease of scoring is an especially important consideration where large num- 
bers of tests are to be used. Since most schools and communities still do 
their test scoring by hand, the amount of time required to score each test is 
an important factor in the total cost of the testing program. The cost of a 
test which requires fifteen minutes to score is much greater than one which 
can be scored in five minutes, regardless of who does the scoring. 

There are a few other considerations that have a bearing on this point. 
The more complicated the scoring, the more chance there is for errors, and 
the more time must be spent in checking the work. Also, if scoring becomes 
too burdensome and time-consuming it may never be finished. Tests may 
even be enthusiastically purchased and administered, only to be placed on 
shelves somewhere, the scoring never completed. Certainly there may be 
other causes contributing to such a state of affairs, but lack of time and 
energy for a burdensome job of scoring must be reckoned with. 

Finally, the success of the scoring process depends to some extent on 
having simple, accurate, and clear directions for scoring. There is wide 
variation among manuals for standardized tests with respect to this point. 
Directions for scoring some tests are so clear that the scorer cannot go 
wrong; at the other extreme are directions which are a veritable puzzle, 
even for experienced testers. However, it must be recognized that the aim 
of test authors and publishers is to consistently make their directions as 
simple and intelligible as possible. It is not to be expected that this goal 
will be uniformly attained in all tests, mainly because of the variation in 
complexity and scope of different tests. 


Ease of Interpretation 


A test may meet all the criteria so far mentioned and yet present great 
difficulties when it comes to interpretation of the results. Ease of inter- 
pretation generally depends on two factors: first, the mechanics of interpre- 
tation, and second, the helps provided for giving meaning and significance 
to the scores. 

The first point largely concerns the transmutation of the raw score on the 
test to some derived score. This is generally done through tables of norms. 
The best that can be done by way of simplification in this process is to set 
up these tables in such a way that they can be easily and accurately read. 
It often helps to present norms both in the form of tables and of graphs. 
A graph, such as a percentile curve, may make the transition from raw 
scores to percentile equivalents easier, faster, and more accurate than a 
table can, especially in those cases which require interpolation in a table. 
Where profiles, either individual or group, are used, sample profiles should 
be given to assist inexperienced test users in constructing some for their 


own Cases. 
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It is also desirable for the sake of easy interpretation not to crowd too 
much on one chart or table. In their desire to have everything in one handy 
place, test authors and publishers will sometimes crowd so much into a 
small space and use such small type that the ordinary user of the tests will 
become confused and lost in the maze of different types of scores, percen- 
tiles, mental ages, chronological ages, deciles, grade norms, and diagnostic 
devices presented on one page or chart. 

On the second point, it may be said that this is perhaps the most difficult 
problem in the entire process of using standardized tests. After the tests 
are given and scored, the inevitable question is, “What do the results 
mean?" It is a sad but true fact that too often the test experts themselves, 
while proficient in their knowledge and advice about tests and how to ad- 
minister and score them, are less skillful in providing practical aids for in- 
terpreting and applying the results. This is not to be blamed wholly on the 
so-called experts. It is extremely difficult to establish general rules for in- 
terpreting and using test results, rules which will apply in all or even most 
situations. What may be right for one test may be entirely inappropriate 
for another. The test manual usually suggests a variety of ways in which 
the test scores may be interpreted and used, and from the manual accom- 
panying each test or group of tests the prospective user of any given test 
will find some suggestions that are appropriate and useful. This is the func- 
tion of the test manual. Some of our best tests now have manuals which in 
themselves are small textbooks on testing. These manuals are usually 
written clearly and without unnecessary technical language so as to appeal 
to teachers and other prospective users of the tests. Usually the manuals 
contain a profusion of aids for simplifying the use and interpretation of the 
tests, and many suggestions of ways in which the results can be used for 
better learning by, and counseling of, pupils. When considering the criterion 
of ease of interpretation in relation to a particular test the prospective user 
must examine the manual carefully to find evidences of how well the cri- 
terion is met. 


e Learning Exercises e 


8. What is meant by objectivity? Subjectivity? Illustrate. How are measuring 
instruments made more objective? 
9. Why must standardized tests be as objective as possible? 
10. Some tests are very difficult and time-consuming to administer. What 
legitimate reasons can you think of for this? 
11. Ifa test-scoring machine is available in your vicinity, try to see the machine 
in operation. What are the limitations on the use of such equipment? 
12. Examine a specimen set of a standardized test with respect to its objectivity, 
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ease of administration, scoring, and interpreting. Is it satisfactory in every respect? 
If not, in what ways could it be improved? 


We shall next discuss a group of characteristics which apply as a rule only 
to standardized tests. In fact, they apply so rarely to unstandardized, in- 
formal, classroom, and teacher-made tests that they may be considered as 
representing perhaps the chief differences between the two groups. These 
characteristics are adequate norms, equivalent forms, and economy. An infor- 
mal, locally-made test seldom has norms in any general sense; it rarely has 
two or more equivalent forms, and the question of cost is not usually a crucial 
factor. In the case of standardized tests, these factors are all of importance. 


Adequate Norms 


Every good standardized test has norms. In fact, one of the main pur- 
poses of the process of standardization is to establish norms. These may be 
of many types, depending upon the type of test and the uses for which it is 
intended. The common types of norms have been illustrated and discussed 
in Chapter 3, so they need not be repeated here. Adequate, usable norms 
are essential to. a good standardized test. There was a time when stand- 
ardized tests were published with inadequate norms, but these days no 
reputable test publisher will put a test on the market without norms of 
some kind, and the better tests will have norms based on a large and reason- 
ably adequate sampling of a representative population. The prospective 
user of a test should satisfy himself that the norms are based on a population 
sample that is representative from the standpoints of geographic areas, rural 
and urban populations, grade level, sex, socio-economic status, and types 
of schools. Careful reading of the test manual, particularly those parts 
describing how the normative population was chosen, will generally reveal 
whether or not the norms are representfifive, dependable, and useful. 

It might be well to emphasize here that because a test is printed does not 
mean that it is standardized, and that all tests which claim to be standard- 
ized do not necessarily have adequate or useful norms. Furthermore, a test 
with adequate norms for a certain group might not be usable with people 
to whom these norms do not apply. For example, a test which has been 

. standardized mostly on children in Utah would probably have limited use 
with a comparable group in New York State. 

We have already pointed out that the way the norms are presented and 
the adequacy of instructions for their use are important factors in the ease 
of interpretation of test results. It should be mentioned again, however, 
that the prospective user of a test can easily determine these qualities by 
careful examination and study of the test and the accompanying manual. 
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Equivalent Forms 


As a rule, a standardized test should have two or more equivalent forms. 
Again, this is one of the common differences between standardized and un- 
standardized tests. However, the fact that a test has two or more forms is 
not always to be taken as satisfactory evidence that the forms are really 
equivalent, for it will sometimes be found upon careful scrutiny of test and 
manual that the forms are not equivalent. Alternate forms may cover 
quite different areas of subject matter so that they are not at all comparable 
in this respect. They may also be unequal in difficulty — either in range of 
difficulty, in average difficulty, or both. Or they may be equivalent at some 
levels and not at others. For example, in a test of achievement for use in 
several successive grades the two forms may show equivalent difficulty at 
one grade level, but not in the next higher or lower grade. Such situations 
are probably the exception rather than the rule, but it is well for the test 
buyer to know that these possibilities exist and to keep them in mind when 
examining tests for possible use. By study of the test and the manual it 
should be possible to determine whether or not the different forms of the 
test are really equivalent. 


Economy 


The factor of economy, which we have already discussed at several points, 
is a real consideration, and we should emphasize here, first, that it must be 
reckoned in broad rather than narrow terms, and second, that — as far as 
possible — economy should be a determining factor in selecting tests only 
if all other criteria are equally well satisfied. 

Tn elaboration of the first point, we may point out that, as with motor 
cars, for example, the important consideration is not so much the initial 
cost as the upkeep. The price per copy of the test booklet and answer sheet 
may well turn out to be one of the fiinor items in the total cost of the testing, 
and it is a good idea to try to make accurate estimates of the total cost per 
pupil before embarking on any testing program. This estimate should in- 
clude, in addition to the cost of the test materials themselves, the expense 
of scoring the tests, analysis of the results, and follow-up. Sometimes it will 
be found that a test which costs less initially will cost more in the long run. 
The test may be made of cheaper materials and have to be replaced sooner; 
or it might not be set up for use with answer sheets in which case a test can 
be used only once; it may require more scoring time; or the test might even 
be found inadequate for its intended purpose, once the results are available. 

The cost per pupil of testing with a battery such as the Stanford Achieve- 
ment Test may be estimated at anywhere from fifteen to fifty cents per pupil, 
depending upon the initial cost of the test booklet, upon who does the scoring 
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and how the cost of this work is figured. If the scoring is done by teachers 
and no allowance is made for the time thus spent, the cost will be near the 
minimum. If the scoring must be computed on an hourly basis the cost 
per pupil will be substantially higher. On the other hand, in the case of a 
test which is quickly and easily scored, this part of the cost will be a minor 
item in the total expense per pupil. Thus, costs must be calculated on the 
basis of more factors than just the actual price of the testing materials. 

The second point relating to the criterion of economy is that no test is a 
bargain cost-wise if it is inferior in other important respects. Only if two 
tests are equally good for the purpose at hand should cost be the deciding 
factor. The prospective purchaser must satisfy himself by all the informa- 
tion available that two (or more) tests will do a job equally well before he 
makes a cost comparison. This is not to say that one should disdain a test 
merely because it is cheap. In the days when tests were consumable, that 
is, when pupils marked their answers on the test itself rather than on sepa- 
rate answer sheets, some test publishers used cheaper, smaller print and a 
less expensive format without any evident harm to the usefulness of the 
test.. If a test is not to be used over and over again with new separate 
answer sheets, a test in the cheaper format may answer the purpose just, as 
well as a more expensive test. 

To summarize, the factor of economy in testing is a matter that goes far 
beyond the list price of the test in the publisher’s catalog. For most users 
of tests in small quantities a difference of a few dollars per year may not be 
of great importance. Those who use large quantities of standardized tests 
year after year, however, must calculate carefully, taking many factors into 
account. Where large numbers of tests are used small differences in cost per 
pupil will become very considerable in the aggregate and over long periods. 


* 
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13. Examine the tables of norms reproduced on pages 56, 57 and 58, with respect 
to the discussion of norms in this chapter. Do they seem adequate and easily usable? 
What evidence do you find in the manuals to support your answer? 

14. What are the advantages of equivalent forms? How would equivalent forms 
of a standardized test be made? 

15. Name several factors that affect the economy of a measurement program 
besides the actual cost of the tests or other instruments used. 
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Objectives as the Basis of All 


Good Measurement 


IMPORTANCE OF DEFINING OBJECTIVES 


Teaching involves five essential processes ; namely, defining goals or objec- 
tives, choosing content, deciding on methods of instruction, the instruction 
itself, and measuring results. The order of sequence is not absolutely fixed, 
particularly in regard to content and method, but measurement ordinarily 
comes last and definition of goals comes first — as it must if teaching is to 
have direction and purpose. To try to teach and evaluate without defining 
objectives is like starting out on a journey without knowing where to go. 
It may be pleasant to wander around for a while, but it is doubtful that any 
sort of progress can be made without some direction. 

The good teacher formulates his objectives, chooses methods and mate- 
rials in accordance with his objectives, employs these methods and ma- 
terials, and uses measurement to determine how well or to what degree the 
objectives have been attained. In a sense everything is determined by the 
objectives. If the objective is to teach how to multiply two-place numbers 
by two-place numbers, the methods and materials will necessarily differ 
from those employed for teaching long division or square root; they will 
differ even more from the methods and materials used in teaching the parts 
of speech, the names of the chemical elements, or the story of the writing of 
the United States Constitution. 

Objectives or goals may be stated in different ways, some of which will 
be discussed and illustrated later in this chapter. It may be that some 
teachers will not consciously formulate any objectives at all, but will simply 
teach “by the book.” Nevertheless, every teacher works toward some ob- 
jectives, even if it is only to get through the textbook by the end of the term. 
Whatever the objectives and no matter how they are formulated or thought 
of, they constitute an essential step or part of teaching. We do not mean 
to imply here that one way of thinking about or formulating goals is as good 
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as another. We are saying only that every teacher necessarily has some 
goals or objectives which give direction and purpose to his work. It is 
highly important that these objectives be stated clearly and explicitly so 
that their meaning and implications are clear and well understood. 

What has been said about defining objectives for teaching applies with 
equal force to measurement and evaluation. In order to measure the re- 
sults and effectiveness of instruction it is essential to know what the teacher 
has been trying to accomplish. When objectives are poorly defined or per- 
haps not defined at all, it is impossible to do an effective job of evaluation. 

In the five-step description of the teaching process mentioned above, 
measurement is the fifth or final step. There are occasions when measure- 
ment may come earlier, as in the case of pre-testing, and sometimes measure- 
ment is followed by re-teaching as in the case of diagnostic testing, but as a 
rule measurement is the final step in the procedure for any given unit or 
period of instruction. It reveals the success of the teacher’s and the pupils’ 
efforts. Measurement is the only way to determine to what extent the ob- 
jectives of instruction have been attained. Unless there is systematic and 
effective appraisal, the extent of progress attained in the classroom must 
remain a matter of subjective opinion or conjecture. 

Certainly teachers’ opinions are valuable in determining the status and 
growth of pupils with respect to educational goals. These opinions con- , 
stitute an important factor in the evaluation process. However, they are 
only one element in the total process and it is important to supplement them 
with systematic and more objective measures. The teacher should employ 
the widest possible variety of measuring and evaluative tools and tech- 
niques, so long as these devices and techniques are practical and appropriate 
for the given situation. The use of a wide range of measurement tools is 
essential, not only from the standpoint of making appraisal more reliable, 
but also because different objectives or goals require different techniques of 
appraisal. If we desire to know how well some of the facts about the early 
history of our country have been learned we use one kind of instrument, 
possibly an objective test; if we wish to know how well pupils can handle 
rulers, read thermometers, or weigh objects, another approach to measure- 
ment would be employed. Still other techniques would be needed to de- 
termine the extent to which some of the precepts of good citizenship carry 
over into actual out-of-school behavior. 

To summarize what has been presented so far, it might be said that objec- 
tives and measurement complement each other and are integral parts of a 
whole. Unless objectives are defined, we do not know what to try to 
measure, and unless we can measure, it is impossible to tell whether or not, 
and to what degree, objectives have been realized. 
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ARRIVING AT A USEFUL STATEMENT OF OBJECTIVES 


Objectives or goals may be stated in various ways. For example, there 
are immediate objectives and ultimate objectives. Immediate objectives 
are often stated in terms of something specific to be learned, some skill, 
knowledge, or understanding to be mastered. Ultimate objectives, on the 
other hand, are more often stated in terms of some long-term goal and are 
likely to be focused more on the learner than on what is learned. They tend 
to emphasize the function of what is learned rather than the knowledge itself. 
In civics, for example, immediate objectives may be to learn about the or- 
ganization of government, the responsibilities and functions of its different 
branches, and the duties and responsibilities of citizens in a democracy. 
Ultimate goals in civics might be to establish a continuing interest in im- 
proving our government, and a willingness conscientiously and consist- 
ently to perform the duties of citizenship — such duties as examining and 
comparing political platforms, candidates, and issues, and exercising the 
right to vote. 

A large share of our teaching and measurement has concerned itself with 
the immediate goals rather than the ultimate ones — testing for recall of 
instructional materials rather than for the ability to apply the knowledge 

. and skills learned. There are several reasons for this. One is that teaching 
and testing for immediate objectives is more practical. Of course it is every 
teacher’s hope and desire that what is taught today may be remembered 
and, even more important, carried over into action tomorrow and the next 
day, and a year or two or ten from now. Likewise, every teacher hopes that 
what is learned in the classroom will function on the playground, at home, 
and elsewhere. But it is almost never practical to measure the achievement 
of these ultimate objectives. 

Second, because of the nature of what we teach, testing for immediate 
objectives is easier than testing Tor ultimate ones. Much of our school 
learning and teaching comes from the printed word. Our immediate goal is 
to have the pupil learn and understand what he has read and do as well as 
possible what he has been taught. It is therefore simpler to measure his 
present comprehension than to measure his ability and disposition to apply 
his knowledge. It is much easier in the classroom or even in the shop or 
laboratory to measure John’s knowledge of the parts and structure of a 
gasoline engine than to measure his ability to repair such an engine; it is 
easier to measure his knowledge of traffic regulations than to measure his 
respect for and adherence to them. 

In the third place, most of our measurement in schools is limited to vari- 
ous types of paper-and-pencil tests for reasons of economy of time and effort. 
And since immediate learning is often “book learning, " our efforts have been 
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largely directed toward the measurement of book knowledge, to the neglect 
of those types of learning concerned with action and actual performance, 
which often cannot be measured at all by paper-and-pencil devices. 

Yet we should never stop trying in our measurement and evaluation prac- 
tices to get at these ultimate goals, for they are the final measure of our 
teaching success. We have accomplished little in our schools if we produce 
only verbalization of knowledge. Such verbalization is comparatively easy 
to measure, but it demonstrates little more than that the learner can recite 
what he has been taught or has read. Formulations of objectives can be 
very useful to the evaluation process if they emphasize the functions of 
knowledge as well as knowledge itself, and if the objectives are expressed 
in terms of the learner’s performance and behavior rather than in terms of 
the facts he has learned. 

Knowledge of facts and principles is of unquestionable importance. In- 
deed, an eminent psychologist has often been quoted as saying that the best 
thinkers are those who have the most information to think with. A student 
cannot hope to apply facts or principles which he does not know. It is nota 
question of either factual knowledge or ultimate desirable action; both are 
important. The reason the latter has been stressed in this discussion is that 
measurement practices have tended to emphasize knowledge rather than 
behavior. When educational objectives are formulated, as much emphasis 
should be given to the ultimate behavioral goals as to the immediate types 
of outcomes. Such a balance would give a desirable and needed emphasis to 
teaching and measurement. 

What has been said so far suggests that formulating objectives for a 
field like science or social studies, or even for a single subject, is a rather 
intricate procedure. Because of the inevitable complexities involved, state- 
ments of objectives are usually the work of committees or groups which are 
formed ona local, state, or, more often, on a national basis. Such groups are 
chosen carefully to assure representatiof: Of different viewpoints and locali- 
ties, and the resulting formulations generally represent the best, most for- 
ward-looking ideas that the group can produce at that time. On the other 
hand, such statements usually represent in some degree a compromise bes 
tween various viewpoints, and may not, therefore, wholly satisfy either the 
very progressive or the very conservative members of the group. As a rule, 
however, the statements of objectives turn out to be acceptable to the 
majority. 

An example of a statement for science teachers is given on page 94.1 

1 Victor H. Noll, “The Objectives of Science Instruction," Science Education in Ameri- 


can Schools, Forty-Sixth Yearbook of the National Society for the Study of Education, 
Part 1 (Chicago: University of Chicago Press, 1941), pp. 28-29. 
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OBJECTIVES OF SCIENCE INSTRUCTION 


A. Funclional information or facts about matters, such as 


1. 


Our universe: earth, moon, stars, weather, and climate. 


B. Funclional concepls, such as 


il: 


All life has evolved from simpler forms. 


C. Funclional understanding of principles, such as 


1. 


Energy can be changed from one form to another. 


D. Instrumental skills, such as ability to 


n 


Perform simple manipulatory activities with science equipment. 


E. Problem-solving skills, such as ability to 


T. 
2. 


Sense a problem. 
Define the problem. 


F. Altitudes, such as 


ie 


Open-mindedness: willingness to consider new facts. 


G. Apprecialions, such as 


1. 


Appreciation of the contributions of scientists. 


H. Interests, such as 


1. 


Interest in some phase of science as a recreational activity or hobby. 


This quotation gives only one or two examples under each category, and 
even the original statement of objectives is incomplete in that all the facts, 
concepts, skills, and interests pertaining to science could never be listed. 
In fields such as the language arts or the social studies the task of formulat- 


ing educational objectives is perhaps even more complex. 


it has been done by similar committees from time to time. An example of a 
typical statement of objectives for the social studies is given below.? 


OBJECTIVES or SOCIAL STUDIES Instruction 


1. Understandings 


a. 


b. 


Of the democratic faith and its meaning for human welfare and 
happiness 

Of the application of democratic faith in the development of the 
American heritage ae 


. Of the forces which have made for world interdependence and the 


need for world organization 


. Of the historical and geographic reasons for the behavior of regional 


and national groups 


. Of the local community and its problems, and the need for wide par- 


ticipation in community concerns by all citizens 
Of the significance in social problems of the mental health and emo- 
tional balance of individual human beings 


2. Alliludes 


a. 


2 Wisconsin Cooperative Education Planning Program, 
Social Studies Program,” Bulletin No. 14 (Madison, Wis.: State Department of Public 


That all human beings regardless of race, national origin, color, or 
any matter over which they may have no control are entitled to 
equal rights to life, liberty, and the pursuit of happiness 


* Instruction, November, 1947), pp. 6-7. 
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Nevertheless, 


"Scope and Sequence of the 
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b. That we concern ourselves with achieving and improving human wel- 
fare and democratic liberties everywhere in the world 

c. That all citizens should participate actively in working toward the 
solution of community problems for social betterment 

d. That reflective group thinking can serve as an approach toward the 
solution of social problems 


3. Skills or abililies 

a. The ability to take part in group discussion 

b. The ability to take part in group planning 

c. The ability to think reflectively on social problems 

d. The ability to search out and use valid and adequate sources of infor- 
mation 

e. The ability to evaluate ideas and opinions on controversial problems 
offered by and through radio, movies, newspapers, periodicals, books, 
etc. 


How does the classroom teacher arrive at useful statements of objectives 
in his own work? It is obvious that few teachers will have the breadth of 
view or knowledge represented by committees, local or national. Yet every 
teacher has his own ideas as to what objectives are important and useful 
for his pupils. He should therefore study such statements as those given 
above and react to, modify, and adapt them to his own purposes. In the 
process he will learn and grow, and the statements thus will serve their ul- 
timate purpose which is the improvement of instruction. Boys and girls 
in the classroom will be the beneficiaries eventually. 

Many teachers find it useful to take specific portions or items from such 
statements and relate them to methods, content, and evaluation techniques. 
This practice may be illustrated as follows: 


Objeclive or Goal Melhod-Conlent Evaluation 


1. Ability to take part 
in group discussion 


2. Understanding of 
lime concept in the 
geological sense 


Committee appointed 
to plan for a field trip 
to local city hall 


Study of table of geo- 
logical eras; field trip 
to study rocks, fossils, 
etc. 


Check list, rating scale 


Paper and pencil test: 
identification of rocks 
and fossils, pictures of 
animals and plants of pre- 


historic times 


The value of any statement of objectives is determined by the extent to 
which the statement is accepted and incorporated into the thinking and 
practice of teachers. It is the responsibility of the alert, professionally- 
minded teacher to read and ponder such formulations, to select those which 
will best apply in his own situation, and, as far as possible, to relate his 
teaching and. measurement practices to the objectives chosen. 


96 Objectives as the Basis of All Good Measurement 


It should be emphasized here that nothing that has been said above 
should be taken to mean that a classroom teacher should not attempt to 
formulate a statement of objectives for his own work. Indeed, this would 
be one of the most useful and thought-provoking activities in which he 
could engage. It would, in all probability, have the effect of vitalizing his 
teaching and would make him intelligently critical of his own procedures as 
a teacher. On the other hand, teachers should not be condemned if they 
adopt as their own objectives those which are expressed either explicitly or 
implicitly in a good textbook. Whatever the nature and source of his ob- 
jectives, it is important that the teacher think about them, adopt whichever 
objectives seem good to him (or possibly his local curriculum committee), 
and incorporate them into his teaching. 

No single test, examination, or procedure can measure all objectives, nor 
can one teacher do an adequate job of weighing all of the many possible 
objectives in a given field. He must choose from an adequate list of objec- 
tives those which he will attempt to measure at any given time and then 
formulate his teaching and measurement program on the basis of the objec- 
tives selected; another time he will decide upon another goal or set of goals 
to be measured, and his teaching and measuring procedures may then be 
quite different. By the process of constantly re-examining and re-appraising 
his objectives the teacher will broaden his outlook on his work and will de- 
velop breadth and skill in measuring a variety of outcomes. This will in- 
evitably result in a fairer, more adequate evaluation of the pupil’s status, 
growth, and progress. 


e Learning Exercises © 


1, Select some phase of a subject, such as: (a) fractions in fifth-grade arithmetic; 
(b) from geography, customs of a people like the Eskimos; (c) rules of punctuation 
in language arts; (d) the writing of simple formulas for chemical compounds; or 
some other phase of a subject with which you are thoroughly familiar, State some 
of the objectives you would consider important for the phase or area you select, and 
describe the kinds of measurement procedures you would employ to determine how 
well the objectives have been attained. 

2. Criticize constructively one of the sample statements of objectives given on 
pages 94 and 95 of this chapter. Does the statement seem to you to be (a) suffi- 
ciently inclusive; (b) detailed enough; (c) applicable at all levels from first grade 
through twelfth; (d) of practical value to a classroom teacher? 


HOW OBJECTIVES FUNCTION IN GOOD MEASUREMENT 


We have already emphasized the importance of objectives in measuring 
the results of instruction, and we have given some sample statements of edu- 
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cational objectives in subject-matter areas, and some suggestions as to how 
a teacher may adopt these to his own purposes or develop some objectives 
of his own. In the remainder of the chapter we shall see how objectives may 
actually function in the process of devising measuring instruments or tech- 
niques. 

Once we have selected the objective to be measured, our chief task is to 
decide on the method of measurement and then construct suitable tests or 
other instruments. Of course, the measurement of different kinds of goals 
presents problems of a widely varying nature. Immediate objectives, such 
as the ability to solve certain problems in addition, subtraction, multiplica- 
tion, and division, are easier to measure than ultimate ones, such as the 
ability to keep accurate financial accounts in the home. Indeed, little of 
our measurement in the schools gets at these more remote yet very im- 
portant outcomes. Again, it is easier to measure knowledge objectives 
than to measure attitudes or appreciations. It is possible to determine quite 
satisfactorily how well an individual has learned the principles of good 
sportsmanship, but it is quite a different matter to measure his disposition 
to follow these principles and adhere to them in athletic competition, 
However, the measurement of the more ultimate and intangible results of 
instruction should not be regarded as a hopeless task or one which teachers 
themselves should not undertake. Great progress has been made in the 
development of a wide variety of measuring and evaluative techniques. 
Teachers and others responsible for evaluation of educational products 
should keep these difficult-to-measure goals always in mind and continually 
experiment with ways of measuring them accurately. 

One of the steps in measurement which often presents difficulties is the 
relating of content or subject matter to educational goals. For example, 
just what is the purpose, from the pupil’s standpoint, of learning this or 
that or the other specific thing? In the case of subjects like homemaking or 
auto mechanics the answers to such questions are fairly clear. However, in 
the case of the more academic subjects like algebra or Latin the answers 
are not so obvious, though they are available in such statements of objec- 
tives as have been cited above. In making tests of certain educational goals 
it is essential to relate course content to objectives. One device which has 
been found helpful in doing this is a two-way chart. An example of such a 
chart for high school biology is shown in Figure 4. 

It will be seen that the major areas of course content are outlined at the 
left, while the educational objectives in terms of pupil behavior are listed 
across the top. The boxes in the main body of the figure represent the points 
of intersection of these two aspects of the work being tested. The numbers 
in those boxes represent the test maker’s judgment as to the amount of 
emphasis that each area should receive in the total examination, in terms 


Figure 4 
Chart Showing Distribution and Relationships of Objectives and Content for a 100-Item Test in High School Biology 


| 50% 45% 5% 
lll. To acquire scientific at- 
OBJECTIVES > {. To achieve understanding of ll. To achieve skill in Sedes of SOT 


| A. functional | B. functional | A. interpretation of | B. problem | C. functional use of bi- suspended judgment, 
| biological biological graphs, charts, solving. ological information open-mindedness, and 
CONTENT facts and principles. data, maps, in the appraisal of sensitivity to problems | 
concepts. tables, etc. real situations. and to cause-effect 
b d relationships. 
Characteristics which all living organisms 
have in common: 
1. Similarities in structure — cells, 2 2 
976 protoplasm. 
2. Life functions. 
———— SSS ee 
Kinds of living things. 3 2 


Nature of processes essential to the life 
of individual organisms: 
25% 1. Food manufacture and utilization. 7 6 
id 2. Circulation and excretion. 
3. Coordination and adjustment — 
nervous and endocrine. 


Processes associated with continuance of 


the species: 
23% 1, Reproduction. 5 6 2 1 
2. Heredity. 
11% Í life on the earth — past and present. 2 3 2 1 
Interrelationships: 
1. Ecological relations — adaptations. 
2. Economically harmful organisms. 
32% 3. Parasitism and disease. 6 6 8 2 


4. Beneficial organisms. 
5. Conservation of our biological re- 
sources. 


Courtesy of Dr. Clarence H. Nelson, Board of Examiners, Michigan State University. 
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of the number or proportion of questions or items to be included in the test. 
Such an analysis of content and objectives can be made as detailed as the 
teacher prefers, and the.total can be made to represent an analysis of 
greater or more limited scope. For example, a similar chart can be drawn 
to analyze test items on the geography of a state, a unit on health and sanita- 
tion in a community, a semester's work including several units, or an 
examination to be used as a final test at the end of the year. 

Another illustration showing how test items may be keyed to objectives 
is to be found in the Jowa Tests of Basic Skills, Multi-Level Edition for 
Grades 3-9, Form 1. A page from the test of Work-Study Skills is repro- 
duced here. 


mo 53 
12. Which building is north of the pork, just 
across the street? 
1) The school 3) The police station 


offic 4) The bank 
The picture mop on this page shows the center 2) The post office ) The bani 


of a town. Each of the most important buildings is 
named, The signs for the streetcar line, the railway | 13, What street is two blocks south of Maple 
line, the bus line, ond the bus stops are shown in the Street? 
key below the mop. 1) Birch Street 3) Oak Street 

2) Station Street 4) Pine Street 
14. Which building is on a street corner? 

1) The school 8) The fire station 

2) The public library 4) The church 


MAKE NO M 
IN THIS BOOK 


15. The park covers how many blocks? 
1) One 3) Three 
2) Two 4) One cannot tell from the map. 


16. Which of these is closest to a bus stop? 


1) The public libra 3) The fire station 
3) The bank "Y 4) The police station 


17. On which comer does the bus line cross the 
streetcar line? 
1) Birch Street and Fourth Avenue 
2) Elm Street and Third Avenue 
3) Elm Street and Fourth Avenue 
© &) It does not cross the streetcar line on this 
map. 


18. Sally lives two doors from the public library. 
John lives south of the pork, Which hos the 
shorter way to school? 

1) Sally 


2) Jan j 
3) Neither. They live the same distance from 


4) One cannot tell for sure. 


19. Which street is the longest? 
1) Birch Street 
2) Elm Street 
1 3) Maple Street 
4) One cannot tell from the map. 


60 ON TO NEXT PAGED 


3 E. F. Lindquist and A. N. Hieronymus, Iowa Tesis of Basic Skills (Boston: Houghton 


Mifflin Company, 1955). 
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Reproduced below is a page from the Teacher’s Manual corresponding to 
the map-reading section of the Work-Study Skills test. Notice the state- 
ments of the various skills involved in map reading, and the table showing 
the relationship of each test item to a specific skill-objective. 


TEST W-1 ; MAP READING c) Determining distance on a globe 
d) Comparing distances 


Skills Involved 4. Ability to determine or trace routes of travel. 
1. Ability to orient map and determine direction. IS C Ablity to viscaliie lancacape festarer 
a) To determine direction from orientation 6. Ability to infer man’s activities or way of living 
b) To determine direction from parallels or merid- a) From physical detail 
ians b) Ability to recognize differences in seasons and 
€) To determine direction of river flow or slope of hours of daylight in different latitudes 
land c) Ability to determine differences in time zones 
2. Ability to locate places on maps and globes 7. Ability to read and interpret facts from pattern 
a) Through the use of standard map symbols maps 
b) Through the use of a key a) To read and compare facts from a single pat- 
©) Through the use of distance and/or direction tem map 
d) Through the use of latitude or longitude b) To read and compare facts from two or more 
3. Ability to determine distances pattern maps 
a) Determining distance on a road map ©) To visualize landscape features 
b) Determining distance by using a scale of miles d) To infer man's way of living 


TEST W-1: MAP READING 


ltemNo. — Forml Form 2 ItemNo. Forml 


Foml Fo 


& 


Peererepprery-eerereserresspes |S 
L[LEFLELIIEEEREPEDEDELEEEEELEI 


5 
PPEPPPRURPPPPEPEDEPPPPEPPPEPENL 
BERSaESARZDEZZS55*555t55559894SHR?BES 
sapÜseEPESSEE^PPEPTLPPPCRESEUP 


PSEPTOESSPEERPPPPEE^DPEPEE 


19 3a 
20 3b 6a ! 
n 2b 3a 

a 2a 4 

5 3a 3a 

Ed n 2a 

s 4 4 

E 2a 2a 

x » b 24 

5 lb 1b 

30 3e 1b 


a 


Altogether, the test includes 89 map-reading items for Grades 3-9. 
Ttems 12-19, as shown on the preceding page, are for Grade 5. After the 
test has been taken and scored, the items answered correctly and those missed 
can be checked against the analysis of skills and the key. Thus, a teacher 
is able to determine which map-reading skills have been attained and which 
have not. This can be done for individual pupils as well as for the class 
as a whole. 

Again, a teacher may wish to construct a test or some other device for 
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measuring the results of instruction on a single objective such as problem- 
solving ability. This task likewise inyolves a breakdown of the objective 
into behavior elements which can be observed or tested and recorded as 
evidence of the learner’s progress toward the desired goal. 

Several examples are given below to show how test items can be related 
to the ultimate, more remote objectives in various areas of common learning. 


, IN THE Area or Stupy SKILLS + 


Objective: Ability to differentiate between fact and opinion. 

Directions: In the list below, some of the sentences are statements of fact, and 
others are statements of opinion. Indicate to which class you think each 
statement belongs by placing the proper letter in the space provided for it. 
Do nol try to decide if each statement is true or false, but only whether it 
should be classified as a statement of fact or of opinion. 


F — Fact 
O — Opinion 


(0) 55. The Democratic party has done more for this country than the 
Republican party has. 

(F) 56. In 1939 there were two World's Fairs held in the United States. 

(F) 57. Alaska is northwest of Oregon. 

(F) 58. Scientific research often results in the production of new prod- 
ucts. 

(O) 59. No war has ever accomplished any good for the world. 

(O) 60. A high tariff increases the prosperity of the country. 

(0) 61. Only his defeat at the Battle of Waterloo prevented Napoleon 
from making himself master of Europe. 


Objective: Understanding of use of common references. 

Directions: The degree to which a social-studies library is useful to students is 
determined partly by the ability of students to obtain needed information. 
Below are two lists. One contains those books which could compose a Social 
Studies Reference Shelf. The other contains a list of questions which you 
might wish to have answered. Do nol try to answer the questions. Indicate 
whether you could find the answers by placing beside the number of the 
question the letter of the reference work in which you would be likely to find 

| the answer most satisfactorily. 


f Example: (F) O. How many students are eurolled in American colleges and 
universities? The answer F refers to The World Almanac, a handbook of 


current information. 


Reference Shelf 


A. Dictionary of American History C. A civics text 
B. An atlas D. An economics text 


4 Horace T. Morse and George H. McCune, Selected Items for the Testing of Study Skills, 
Revised Edition, National Council for the Social Studies, Bulletin No. 15 (Washington, 
D.C.: National Education Association, 1949). By permission of the publisher. 
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E. Who's Who in America H. Official state government hand- 
F. The World Almanae book i 
G. Reader's Guide to Periodical I. Dictionary of American Biography 
Literature 
Questions 


(B) 110. How does North America compare in size with Africa? 

(H) 111. Who is the chief justice of your state supreme court? 

(F) 112. How many persons were killed by autos last year? 

(A) 113. When was the Cumberland Road built? 

(H) 114. Who is the official custodian of state laws? 

(G) 115. What was the political significance of the last Congressional 
election? 


In THE LANGUAGE Arts® 


Objective: To organize and express thought logically, clearly, and effectively 
in sentences or in larger units. 

The logical organization of thoughts into clear and effective sentences and 
paragraphs includes a clearly defined sentence sense based upon an understand- 
ing of word meanings and of the use of words as expressive and connective 
devices. Use tests such as the following: 


(1) Are these groups of words sentences? Draw a line under Yes or No. 


(a) On the way to school Yes No 
(b) He has lost his pencil. Yes No 
(c) Please be quiet. Yes No 
(d) Baseball in the park. Yes No 


(2) In each exercise below part of a sentence has been cut off by a period. 
Write each sentence correctly. 


The pup grabbed the bone. And ran out into the yard. 
With a cross growl. Dixie started after him. 


(3) Put a cross through every and that is not needed in the story below. Put 
in capital letters and periods where they are needed to make good sentences. 


Thave a big white rabbit and his name is Bumpo and every morning 
he sits up and begs for a handful of clover and one day I went to feed 
him and he was gone and I found that he had dug out of his pen. 


(4) Place an X before the one sentence in each group of three that represents 
the best sentence structure. 


pee ode: I counted six paintings walking down the stairs. 
- While walking down the stairs six paintings were counted. 
Lr While I was walking down the stairs I counted six paintings. 


5 Harry A. Greene and William S. Gray, “The Measurement of Understanding in the 
Language Arts," Chapter IX in The Measurement of Understanding, Forty-Fifth Year- 
book of the National Society for the Study of Education, Part I (Chicago: University of 
Chicago Press, 1946). By permission of the publisher. 
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(= He dropped the bundle he was carrying to his mother in the 
mud. 
B. The bundle fell into the mud which he was carrying to his 
mother. 
cue Into the mud fell the bundle which he was carrying to his 
mother. 


Objective: To clarify meaning by following correct language usage. 

Correct usage of pronoun and verb forms and subject-predicate relationships, 
and the avoidance of miscellaneous errors, such as double negatives, vague ante- 
cedents and redundancy, is a matter of correct habit closely related to under- 
standing in expression. Following are examples of appropriate test items: 


(1) Rewrite the following sentences selecting the correct pronoun from those 
in the parentheses. 


(a) (He, Him) and (I, me) were born in the same town. 
(b) Mother sent (we, us) girls to the store. 
(2) Rewrite the following sentences using the correct verb forms from those 
in the parentheses. 
(a) (Was, Were) you at the show last night? 
(b) It (doesn’t, don’t) seem so cold now. 
(3) Faulty expressions appear in these sentences, Rewrite the sentences cor- 
recting all mistakes. 


(a) Where did you see him at? 
(b) We haven't hardly no time left. 


In ELEMENTARY MATHEMATICS ê 
Objeclive: Interpretation of Data Presented Graphically and in Tables. 


1. According to the chart, which food changed most in price? 


YEAR Pice or Foops 
Eggs Bread Milk Roast 
Last year $0.48 $0.11 $0.13 $0.39 
This year 0.51 0.11 0.11 0.43 
a) eggs b) bread c) milk d) roast 


2. If the prices in the chart above are considered a fair sample of the cost of 
living, how does the cost this year compare to the cost last year? 


a) the same b) less c) more d) cannot tell 
3. In social studies we have made a line along which we will arrange dates in 


*'Ben A. Sueltz, Holmes Boynton, and Irene Sauble, “The Measurement of Under- 
standing in Elementary School Mathematics," Chapter VII in The Measurement of Un- 
derstanding, Forty-Fifth Yearbook of the National Society for the Study of Education, 
Part I (Chicago: University of Chicago Press, 1946). By permission of the publisher. 
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history. The line begins with the year 1700 and ends at 2000. What letter 
on the line is at the year 1812? 


a) A b) B DN d)D 
1700 2000 
A B Cc D 
4. Study the graph of Al- ALBERTA'S WEIGHT 
berta’s weight. During 110 


what year did she gain 
most in weight? 


(a) 10 to 11 a 100 
(b) 11 to 12 3 
(c) 12 to 13 E 
(d) 13 to 14 = i 


5. How old was Alberta when she 
weighed approximately 100 
pounds? 80 

10 11 12 13 14 

Age 


a a) ll years b) nearly 12 
c) exactly 12 d) a little more 
than 12 


The examples given in the preceding pages illustrate the design of paper- 
and-pencil tests which go beyond the measurement of mere knowledge and 
get at understandings such as ability to interpret and ability to apply 
knowledge. In most subjects, particularly academic ones, measurement is 
largely confined to paper-and-pencil devices. It is possible, however, to use 
other kinds of tests in some areas like shop, business, homemaking, and 
agriculture. Here, actual performance can be observed and rated. For 
example, when a girl is given a recipe for a cake, access to the necessary 
ingredients and equipment, and is permitted to prepare the cake according 
to directions, it is possible to observe her behavior and appraise it by means 
of a check-list, and to judge the product by the use of some kind of a rating 
device. A series of rating scales for different products of food preparation 
has been worked out by Clara M. Brown. A sample scale for cake will 
illustrate the nature of these rating devices. 


Cake (Angel or Sponge) 


1 2 3 Score 
Appearance 1. Sunken or very rounded Flat or slightly 
top rounded top Top 


" Clara M. Brown, Food Score Cards, (Minneapolis, Minn.: University of Minnesota 
Press, 1940). 
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2. Sugary surface or deep Slightly rough 
crevices surface like 
z macaroons PAEA 
Color 3. Dark brown or pale Even, delicate 
brown gut. 
Moisture Content 4. Dry or insufficiently Slightly moist 
baked do coL 
"Texture 5. Coarse Small holes, 
uniformly dis- 
tributed Di MEC 
Lightness 6. Heavy Very Light [Obr e e 
Tenderness 7. Tough Very Tender paca 
Taste and Flavor 8. Flat, too sweet, eggy, Pleasing, deli- 
or too highly flavored cate flavor DIA T 


Similarly, in industrial arts performance and product are easily judged 
directly when conditions are fairly well controlled. Again, in a typewriting 
class the teacher can place his pupils in a situation which closely resembles 
an on-the-job task, such as transcribing a letter, so that he can observe their 
performance and evaluate the result. 

In these areas of teaching it is possible to measure behavior in situations 
which closely resemble actual working conditions, though it is not easy to 
do so in the more academic subjects like science or mathematics or history. 
In such areas the ultimate and more remote goals such as scientific atti- 
tudes, problem-solving skills, and good citizenship can often be measured 
only in a verbalized form. It is generally possible to determine how a pupil 
says he would behave under a given set of circumstances, but it is nearly 
always difficult — if not impossible — to subject the student to realistic 
circumstances for measurement purposes. Yet, as the illustrations that 
we have given demonstrate, considerable progress has been made and more 
will certainly come. Teachers and others responsible for measurement and 
evaluation in the schools and wherever measurement is carried on should 
keep always before them the ideal of making their practices in this area 
functional; that is, measurement should be concerned with appraisal of 
behavior, as far as possible, rather than with verbalization. 

At the same time, teachers and prospective teachers should use or experi- 
ment with all current methods of measurement, though, as was pointed out 
in the beginning of this chapter, paper-and-pencil tests will continue to be 
the mainstay of the measurement program. Moreover, behavior should not 
be conceived of in too narrow a sense. When a child solves an arithmetic 
problem, reads a story with understanding, or interprets a map or a chart, 
he is exhibiting kinds of behavior that can be adequately measured by paper- 
and-pencil tests. The teacher’s goal should be to make his tests as adequate 
as possible, and to supplement them whenever possible with a wide variety 
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of other types of measurement and evaluation. The broader and more com- 
prehensive the approach, the better the chances of encompassing in the 
measurement program all of the important objectives of instruction. 


* Learning Exercise e 


3. Construct a two-way chart like that shown for biology on page 98 of this 
chapter. (The class may be divided into two groups for this assignment, one to work 
on an elementary, and one on a secondary, subject or field. Or the class may be 
divided into smaller groups according to major interest, such as mathematics, sci- 
ence, English, etc.) Present the results for evaluation by the instructor and the 
rest of the class. 


Annotated Bibliography 


1, Bloom, Benjamin S., el al. Taxonomy of Educational Objectives, Preliminary 
Edition. New York: Longmans, Green and Company, 1954. A discussion of the 
problem of classifying educational objectives in a systematic way. Somewhat ad- 
vanced for the beginner in test construction, but contains many useful ideas. 


2. Dressel, Paul L., and Mayhew, Lewis B. General Education; Explorations in 
Evaluation. Washington, D.C.: American Council on Education, 1954, 302 pp. 
The report of a cooperative study by nineteen colleges and universities of objectives 
and methods of evaluation in the areas of social science, communications, Science, 
and the humanities, with emphasis on critical thinking and attitudes. A thorough 
and comprehensive exploration of goals in general education in higher education. 


3. Greene, Harry A., Jorgensen, Albert N., and Gerberich, J. Raymond. Meas- 
urement and Evaluation in the Elementary School, Second Edition. New York: Long- 
mans, Green and Company, 1953. 

—— Measurement and Evaluation in the Secondary School, Second Edition. 
New York: Longmans, Green and Company, 1954. 

The latter half of each book consists,of separate chapters on measurement in the 
commonly taught subjects in the elementary and the secondary school, respectively. 
Each chapter presents a statement and discussion of objectives in a subject or sub- 
ject field. 


4. Jordan, A. M. Measurement in Educalion. New York: MeGraw-Hill Book 
Company, Inc., 1953. 533 pp. Chapters 5 through 13 deal with measurement in 
subject-matter fields such as language, mathematics, social Sciences, etc. Each 
chapter includes a statement and discussion of educational goals in that subject- 
matter field. 


5. Lindquist, E. F. (ed. Educational Measuremenl. Washington, D.C.: 
American Council on Education, 1951. Chapter 5: “Preliminary Considerations 
in Objective Test Construction,” by E. F. Lindquist. Chapter 6: “Planning the 
Objective Test,” by K. W. Vaughn. Both chapters, though somewhat advanced 
in concept for ordinary measurement activities in the classroom, present, many 
good ideas and useful suggestions on objectives and construction of tests. 
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6. The Measurement of Understanding. Forty-Fifth Yearbook of the National 
Society for the Study of Education, Part I. Chicago: University of Chicago Press, 
1946. 338 pp. Emphasizes the importance of trying to teach and test for under- 
standing. The main body of the report consists of twelve chapters, each dealing 
with a major area of instruction and presenting a discussion and examples of 
measurement of objectives in that area, A wealth of ideas for the person concerned 
with improvement of objective measures of achievement. : 


1. Morse, Horace T., and McCune, George H. Selected Items for the Testing of 
Sludy Skills, Revised Edition. National Council for the Social Studies, Bulletin 
No. 15. Washington, D.C.: National Education Association, 1949. 81 pp. The 
first part of this bulletin presents a discussion of the problems involved in formulat- 
ing objectives and devising tests for study skills. The major portion is given to 
selected items for these purposes. 


8. Remmers, H. H., and Gage, N. L. Educational Measurement and Evaluation, 
Revised Edition. New York: Harper and Brothers, 1955. Chapter 2. An ex- 
cellent discussion of educational objectives with suggestions on how to formulate 
them, and illustrations of objectives of various types for different levels. 


9. Ross, C. C., and Stanley, Julian C. Measurement in Today's Schools, Third 
Edition. New York: Prentice-Hall, Inc., 1954. Chapter 5. A brief, practical 
discussion of principles of test construction, including planning, preparing, trying- 
out and evaluating the test. 


10. Thomas, R. Murray. Judging Student Progress. New York: Longmans, 
Green and Company, 1954. Chapter 2. A good presentation of the problem of 
defining educational goals and devising methods of evaluating them. Written 
primarily with the elementary teacher in mind, though high school teachers can 
profit from the chapter. 


11. Thorndike, Robert L., and Hagen, Elizabeth. Measurement and Evalualion 
in Psychology and Education. New York: John Wiley and Sons, Inc., 1955. Chap- 
ter 3. In the chapter dealing with teacher-made tests there is a brief discussion of 
the function of objectives in developing such instruments. 


12. Wrightstone, J. Wayne; Justman, Joseph; and Robbins, Irving. valuation 
in Modern Education. New York: American Book Company, 1956. Chapter 2. 
Steps in the process of evaluation, including formulation, definition, and clarifica- 
tion of objectives, and selection of tests for each objective, are briefly discussed. 
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STANDARDIZED VERSUS TEACHER-MADE TESTS 


The majority of schools in the United States today make use of standard- 
ized tests of one kind or another. Most tests of intelligence, aptitude, per- 
sonality, and interests are standardized tests, made by specialists for a test 
publisher, and sold by the publisher throughout the country. Few schools or 
school systems, except in the very large city organizations, attempt to de- 
velop such tests for their own use. 

The situation with respect to achievement tests is somewhat different. 
There are, of course, many standardized achievement tests on the market, 
and literally millions of them are used every year. These include tests in 
the separate subjects or branches as well as the achievement batteries. 
However, teachers usually feel that these tests do not adequately measure 
their own or the local objectives of instruction. While standardized tests 
are very useful in some ways, they are not usually the principal method of 
measuring achievement. In general, the classroom teacher himself is relied 
upon for the formulation of achievement tests. It is important, therefore, 
that the teacher’s professional training include some instruction on effective 
ways of planning, constructing, and evaluating various measuring instru- 
ments. 

Clearly, no standardized test of achievement can serve the needs and 
purposes of every local situation. The nature of the requirements for a 
standardized test is such that the test must be largely confined to the ele- 
ments of instruction which are common in a large number of schools. There- 
fore, such a test cannot, if it is to be maximally useful, include all those ele- 
ments which are peculiar to any one or even a limited number of schools. 
The most desirable and probably the most common practice is to use both 
standardized and teacher-made measuring instruments in most situations. 
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Both serve useful though somewhat different purposes, and both are im- 
portant parts of a well-rounded measurement program. 


e Learning Exercise € 


1. Examine a standardized test of achievement in a subject of your choice, pref- 
erably in your major course. What objectives does it seem to measure adequately? 
What objectives do you think you would measure by devices of your own? 


Since teacher-made tests play an important part in the evaluation prac- 
tices of schools, it is well to give some attention to accepted principles of 
planning, constructing, using, and evaluating such instruments. These are 
the four main stages in the process of testing. The stages of planning and 
constructing will be considered in this chapter, and the use and evaluation 
of teacher-made tests will be examined in the following chapter. 

In developing a standardized test, as, for example, one for first-year al- 
gebra, the planning is generally quite extensive and detailed. Textbooks, 
courses of study, and committee reports are analyzed for common objectives 
and content; the tests are carefully planned as to length, administration, 
and scoring. When a committee of teachers plans a test for local use some 
of these same steps will be carried out, though in a less elaborate and 
formal way. Likewise, a teacher constructing a test or measuring device 
for his own use does not usually need to go through such a formal procedure. 
In the first place, he has a certain degree of choice regarding objectives and 
methods. Then, too, he knows pretty clearly what he wants the test to 
cover or measure, when he wants to use it, and how much time he can 
give to it. Finally, he knows what he has been teaching and what testing 
he has already done so that the proposed test can be fitted into the situation 
he is familiar with. In other words, much of what constitutes planning in 
the case of a standardized test is taken care of more or less incidentally and 
automatically when a teacher devises tests for his own use. 

Some aspects of planning have already been discussed in Chapter 5. 
There, examples were given of test questions designed to measure different 
kinds of instructional objectives. Also, those examples illustrated different 
types of test questions. In the rest of this chapter additional consideration 
will be given the planning of tests and the problems and principles of con- 
structing good questions or items for locally made tests. 


e Learning Exercise © 


2. List some of the decisions a teacher makes in planning a test for his own use. 
Illustrate each with a specific choice, one which you might make for a test in your 
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field of specialization, For example, you might decide that your test would be 
designed to measure achievement in the knowledge and understanding of our solar 
system, that you would allow 30 minutes for it, etc. 


BASIC QUALIFICATIONS OF THE TEST-MAKER 


To make good achievement tests requires three somewhat different abili- 
ties. In the first place, one must have an adequate knowledge of subject 
matter. It isnot possible to construct a good examination without adequate 
knowledge of the field, whether this be reading, civics, driver-training, or 
some other specialty. The person who attempts to construct examinations 
without such a knowledge foundation quickly reveals his deficiencies both 
to his associates and, sooner or later, to his pupils. 

A second requirement for making good examination questions is some 
degree of knowledge and skill in the techniques of test construction. This, 
contrary to what some students suppose, is not something that “comes 
naturally." The techniques of making acceptable test items have been 
developed by the experience of test experts and teachers over a period of 
many years, and much of value has thus been learned. Even in the prepa- 
ration of the essay examination, considerable thought and effort have been 
expended in finding ways to eliminate some of the shortcomings without 
sacrificing the good qualities. 

The third requisite for the successful test-maker is a knack of putting 
ideas accurately, concisely, and clearly into words. The ability to apply 
subject-matter knowledge and test-construction skill to the formulation of 
questions which will be unambiguous and brief, and which will measure 
accurately what the maker of the instrument intends, is almost an art, The 
ability to formulate accurate, concise, and clear test items can be developed 
to some extent by most teachers who possess the first two qualifications, 
and who have a good command of the language and a desire to learn, 

In the kind of courses for which this book is intended, little or nothing 
can be done about the first and the third requirements. Adequate knowl- 
edge of subject matter and ability to put ideas into good, clear English 
are not the objectives of a course in measurement. However, it is the ob- 
jective of this course and this book to provide some understanding of the 
basic principles of making good classroom tests and, as far as possible, to 
give some practice in doing this. ^ 


o Learning Exercise 9 


3. Rate yourself on the following scale by placing a check mark on each line indi- 
cating where you think you stand with respect to each of the characteristics listed, 


Preliminary Steps — n 


(a) Command of my | | 
subject or major Superior ^ Better Average Distinctly Weak 
in comparison with than below 


my classmates AEREE Meise 
(b) Experience in ral | 
making test A great Substantial A fair Little None 
questions deal amount 
(e) Command of | | 
English One of my Rather Just Not too Poor 
strongest good average strong 


assets 


Rank yourself on the three characteristics listed above. Best —— Next best —— 
Most lacking — 


PRELIMINARY STEPS 


As we have emphasized in Chapter 5, the first step in planning a test or 
measuring instrument is to decide what goals or objectives to measure. 
Having defined his objectives, the teacher then decides what type of test 
will best accomplish his purposes. Perhaps he decides an essay test will 
be best, or he may decide on the objective type. If the latter, he must 
determine what kinds of items to use, i.e., true-false, multiple-choice, match- 
ing, short-answer, or possibly some modifications and probably some com- 
binations of these. "These decisions are usually influenced by what types of 
items the teacher has had most success with, and the nature of the content, 
processes, or skills to be measured. Often it is best not to make firm deci- 
sions on such matters in advance, but to construct the kind of item that 
seems most suited to the particular objectives and content as one progresses 
with the construction of the test. 

In practice, it is customary to begin by a canvass of the instructional 
materials and. activities in the light of the educational objectives to be 
measured by the test. These materials and activities include reading as- 
signments, problems, experiments performed, films shown, discussions, field 
trips, etc. As the teacher reviews these, their relationship to important 
outcomes will become apparent and will probably suggest approaches to 
evaluation. Assuming that the test-maker decides to use objective tech- 
niques, he may find that one phase of the work lends itself to one type of 
question, whereas another purpose may be best served by a different type 
of question, In such cases the test-maker should generally follow his natural 
inclination and not attempt to make all of his true-false items at once, then 


all his multiple-choice items, ete. The nature of the knowledge, skill, or out- 
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come to be tested usually suggests and may even determine the kind of item 
which is most appropriate. On the other hand, it sometimes seems possible 
to accomplish one’s purpose equally well with either of two different types 
of items. In such cases, other considerations will determine which type to 
use. 

As the questions are devised, it is recommended that each be written on 
a 3" X5” card. When all the questions have been made the cards may be 
sorted and arranged in any way or on any basis desired, such as by type of 
item, content, length, estimated difficulty, source of the item, etc. As in- 
formation is obtained on the effectiveness of items through tryout, such data 
can be entered on the cards, as well as dates of construction and use, and 
cross-references. Cards provide a high degree of flexibility which is of great 
value. They are easily filed and grouped. Also, single cards can be easily 
eliminated without disturbing the rest of the file if any of the items prove 
ineffective. As the teacher builds up his file of test items, he will be able to 
select different samples of questions for various tests and other purposes. 


TYPES OF TEST QUESTIONS 


In this discussion a distinction will be made between the more subjective 
types of question such as the essay and short-answer types, and the more 
objeclive types such as true-false, multiple-choice, and matching. This is 
actually a distinction of degree rather than of kind. That is, objectivity is a 
continuous and variable quality; test questions are neither wholly objective 
nor wholly subjective. For example, the short-answer item is thought of as 
a variation of the essay question, but is considered somewhat more objective 
than the “discuss,” “explain,” or “analyze” forms. 

Usually the objectivity of an examination question is judged by the com- 
plexity of the pupil's answer and the resulting degree of difficulty in scoring. 
If the scoring requires judgment and evaluation of the response on the part 
of the scorer, the question or item is said to be subjective; to the extent that 
judgment and evaluation are reduced or eliminated from the scoring 
process, the item is objective. Most standardized tests except those for 
young children can be scored by clerical workers using a scoring key, or by a 
test-scoring machine. The process in either case is a mechanical one. Ques- 
tions of the essay or short-answer variety, on the other hand, cannot be so 
scored. However, questions of this type can be more or less objective, as 
will be shown later; they are not all equally or wholly subjective. 

The nature of the scoring and the judgment it requires depend also on the 
nature of the response the pupil is required to make. If the response consists 
of a figure, a plus mark or zero, a letter, or a black mark between two printed 
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lines, the scoring can be done mechanically. When the examinee is asked to 
write out his answer in words, draw a figure, make an outline or something 
of this nature, the scoring becomes more complex and subjective. 

From another standpoint, the more objective type of question is some- 
times classified as a recognilion type in that all of the information necessary 
to answer the question is supplied; the short-answer and essay types, on 
the other hand, are classified as a recall kind of question, since the examinee 
must himself supply the answer. This distinction, though not entirely clear- 
cut, may be helpful to the student in his thinking about the matter of objec- 
tivity. 

With these few preliminary observations in mind, we may proceed to a 
consideration of the various commonly used kinds of examination questions, 
beginning with the more subjective and going on to the more objective 
types. 


e Learning Exercises © 


4, Classifying test questions as objective or subjective is something like trying to 
classify all people into two groups, tall and short, or bright and dull. Is such 


classification defensible? Justify your answer. 
5. Even if one uses nothing but the most objective tests there are still some sub- 


jective elements in the teacher's job of evaluating and marking. Mention some of 
them. Is this a reason to stop trying to be more objective? 


Essay Questions 

Although the more objective types of tests have had a very wide accept- 
ance during the last fifty years, the essay question still finds wide use. Ina 
recent survey! of measurement practices of some 2,303 high school teachers 
in thirty-five states, 13.7 per cent of the teachers questioned said they used 
no essay tests at all. It seems reasonable to conclude, then, that most of the 
other 86.3 per cent do use them, at least occasionally. Also, 81.2 per cent 
reported using short-answer or completion items “very often” or “fairly 
often.” 

Probably the essay-type question is so well known that it doesn’t require 
definition here ; nevertheless, a few words of explanation may serve to clarify 
or supplement what the student already knows about this form. The essay 
test usually consists of questions beginning with or including such directions 
as "discuss," "explain," "outline," "evaluate," "define," "compare," 
“contrast,” and "describe." The pupil is allowed comparative freedom 

1 Victor H. Noll and Walter N. Durost, ** Measurement Practices and Preferences of 
High School Teachers," Test Service Notebook No. 8 (Yonkers-on-Hudson, N.Y.: World 
Book Company). 
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with respect to what his answer shall include, its wording, length, and organ- 
ization. Three examples from typical teacher-made tests follow: 


Discuss the events of the period 1850 to 1861 that led to the 
outbreak of the Civil War. 


Explain the essential differences between a standardized test 
and a teacher-made test. 


What are the important steps in processing milk from dairy 
farm to consumer? Describe each one and explain its function. 


Although essay-type questions have continued in favor among teachers 
for a long time, and are stoutly defended by their many advocates, that 
type of question has traditionally been the object of much criticism. A 
considerable amount of experimentation, designed largely to show its weak- 
nesses, has been reported in educational literature. Much of this research 
has been directed towards proving the unreliability of the essay question. 
The pioneer study of this nature, at least in the United States, was re- 
ported by Starch and Elliott in 1912.? These investigators took a typical 
examination paper written by a pupil in English and had it graded inde- 
pendently by a large number of teachers of English; the same was done 
with a geometry paper and a history paper. In each case the results were 
quite similar: the same paper received marks ranging all the way from 
nearly perfect to very low failure. Similar results were reported during 
the next decade by other investigators. 

This weakness in the scoring of the essay question has been demonstrated 
in still another way. One investigator had teachers score the same set of 
papers twice, with an interval of several months between.’ The findings 
pointed to the conclusion that these teachers did not even agree with their 
own judgments of the quality of, the same set of papers when these judg- 
ments were made at two different times. 

A second principal weakness of the essay examination is that of limited 
sampling. The typical essay test consists of from 5 to 10 or 12 questions. 
An objective test which allows the same amount of time for answering ques- 
tions as the essay test, might well include 100 or more items. Although the 
essay questions are larger units, they do not usually constitute as adequate 
and representative a sample of the field being tested as the 100 objective 
items. Therefore, in the case of the essay test, the teacher must base his 


1 Daniel Starch and Edward C. Elliott, “Reliability of Grading High School Work 
in English," School Review, 20:442-57 (September, 1912). Subsequent articles dealing 
with mathematics and with history in School Review: 21:254-59 and 21:676-81 (April and 
December, 1913). 

3 Walter C. Eells, "Reliability of Repeated Grading of Essay Type Questions," 
Journal of Educational Psychology, 31:48-52 (January, 1930). 
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evaluation of the pupil’s accomplishments on a sampling which is much 
more fragmentary, limited, and sometimes biased. In a 100-item test a 
pupil’s achievement is sampled 100 times and he is called upon to make that 
many separate responses or judgments. In an examination consisting of 
10 essay questions, the person reading the paper has a much smaller sam- 
pling of the student’s accomplishments upon which to make a judgment; also, 
if the pupil happens to be weak or deficient on one or two essay questions 
he is apt to be penalized far more heavily than he would be for equivalent 
deficiencies on the objective test. 

Limited sampling affects the reliability of the essay test and also its 
validity, since such a test is likely to give lower correlations with other meas- 
ures of the same abilities than results obtained from measures that sample 
more adequately. 

Another disadvantage of the essay question is the time required to read 
the answers. While this kind of examination is rather easily and quickly 
made, judging and scoring the answers is very time-consuming and often 
tiresome. This, of course, makes the essay test expensive, since it must be 
read either by the teacher himself or by equally competent scorers. Usu- 
ally such a test cannot be evaluated by clerical workers or by mechanical 
methods. 

The chief advantages claimed for the essay examination may be stated 
as follows: (a) it is easier to construct than a comparable objective test; 
(b) it can often measure higher mental processes such as ability to think, 
to organize knowledge, and to express ideas clearly, concisely, and in good 
English; (c) the essay test often requires a more useful and rewarding kind 
of study and preparation on the part of the pupil. In preparing for essay 
tests it is said that the pupil does a more thorough and thoughtful job, con- 
centrating on the important larger aspects, relationships, and organization 
of a given subject, whereas when he expects an objective test he usually 
tries to learn specific and possibly isolated bits of information. 

On the first point, there probably can be no serious difference of opinion. 
It is easier and much less time-consuming to prepare ten essay questions 
and write them on the board than to prepare a 100-item objective test. 
In practice, the essay examination is often prepared at the last minute, 
whereas such hurried preparation is not possible in the case of an objective 
test. i 
On the second point, we can say only that evidence that the essay exami- 
nation actually méasures higher mental processes is hard to find. This is 
not to deny that such advantages exist, but simply to point out that no one 
has apparently been able, or taken the trouble, to find and present evidence 
to support this assumed advantage of the essay examination. Implicit in 
the claim is the idea that the objective test does not, or cannot, measure 


116 Planning and Constructing the Teacher-Made Test 


these higher mental processes. Here again, evidence to support this impli- 
cation seems to be lacking. Some work has been done to develop objective 
tests that will measure mental abilities of a higher order, such as ability to 
interpret data, draw conclusions, etc., and this research suggests that it 
may be possible to measure such qualities by means of objective tests. 
Some of this work is described in Chapter 5. Until further research pro- 
duces some evidence on the question, this claimed advantage of the essay 
examination will have to be taken largely on faith. 

The third advantage, namely, that of different preparation for essay tests 
than for more objective ones, seems not to be unequivocally supported by 
evidence, Some studies have been reported showing that when students 
expect an objective test they study for details, while in preparing for an 
essay test they focus attention upon relationships, trends, and organization.‘ 
However, a more recent study by Vallance’ failed to confirm these findings. 


e Learning Exercise * 


6. Can you think of other important advantages and serious shortcomings of 
essay tests? Consult some of the references at the end of this chapter for evidence 
on the question. 


Having reviewed the main lines of discussion and criticism of the essay 
question, we might ask what can be done about it. Is the essay question a 
type that can be recommended for general use? Perhaps a sensible point cf 
view for the student of measurement might be expressed as follows: the 
essay question is believed by its advocates to have a number of unique ad- 
vantages over the more objective types; apparently, these advantages have 
been neither proved nor disproved, though some of the basic weaknesses 
have been clearly demonstrated ; nevertheless, since the essay type of ques- 
tion is used regularly by many teachers, we should try to improve this prac- 
tice in every possible way. 

Suggestions for improvement of the essay question are usually directed 
at two aspects of the process, namely, preparation of the questions and 
directions, and the scoring or evaluation of the answers. A good deal of 
helpful material has been written on these two points. Although a complete 
review of such material is not appropriate or necessary here, a summary of 
the most useful ideas should be helpful. With respect to planning, con- 


4See, for example, Paul W. Terry, “How Students Review for Objective and Essay 
Tests," Elementary School Journal, 33:592-603 (April, 1933). Also, Harl R. Douglass 
and Margaret Tallmadge, '* How University Students Prepare for New Types of Exami- 
nations," School and Society, 39:318-20 (March 10, 1934). 

5 Theodore R. Vallance, “Comparison of Essay and Objective Examinations as Learn- 
ing Experiences,” Journal of Educational Research, 41:279-88 (December, 1947). 
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structing, and using essay questions, the following recommendations are 
often used as guides: 

(i) Define and restrict the field or area to be covered by the question. For 
example, in high school chemistry one might say, 


Describe the contact process for making sulfuric acid. 
However, it would be better to say, 


Write the equations for the contact process for making sulfuric acid. 
Name all the substances used, and the products. 

Draw a labelled diagram showing where each step of the process occurs. 
What are the by-products of the process and how are they used? 


Gi) The teacher should give more time and thought in advance lo the prepara- 
lion of essay questions. It is logical and obvious that if essay questions are 
to measure higher thought processes adequately, some amount of fore- 
thought must be involved in the planning of the questions. Easy as it may 
seem to write a few questions on the board on the spur of the moment, it is 
almost impossible to produce really good questions without fairly extensive 
preparation. Although hastily formulated essay questions may be effective 
once in a great while, no one would seriously argue that this is the best way 
to produce high quality test questions consistently. Indeed, whatever the 
type of test being prepared, the teacher must be willing to devote time, care- 
ful thought, and painstaking effort to its planning. 

(iii) Pupils should be told in advance what type of examination they will be 
given; the value or weight of individual questions should be indicated and in- 
structions should be made clear. Also, if quality of handwriting, spelling, 
grammar, and similar considerations are to be taken into account by the 
scorer, this should be made clear to the pupils before the test is admin- 
istered. 

(iv) Optional questions giving pupils a choice should not be provided, for 
such questions reduce the comparability of the sampling of pupils’ learning 
and therefore of the basis for scoring. When pupils have a choice of ques- 
tions, the lessened comparability of the individual papers affects the accu- 
racy of grading and makes it much more difficult to evaluate the papers on 


a common basis. 


Suggestions for improving the scoring of essay questions are rather gen- 
erally agreed upon by those who have studied the matter. Below are some 
of the more frequently mentioned suggestions: 

(i) Determine in advance standards for each question to be used on the exam- 
ination. This is usually done most satisfactorily by writing out a set of 
model answers. ‘The answers written by pupils can then be compared with 
the key. Preparation and use of such a key is valuable in at least two 
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ways: it requires the teacher to think through the implications of each ques- 
tion, and it provides a standard with which all papers can be compared for 
evaluation. It is desirable to prepare such a key before the questions are 
used, because during this preparation the teacher may discover faults which 
will suggest improvements in or elimination of a question. 

(ii) Remove or cover all pupil-identifying data on the papers to be read. 
Tt is not always easy for the teacher or scorer to avoid being influenced by 
irrelevant matters when he knows the identity of the pupil whose paper he 
is reading. The evaluation should be based, as far as possible, only on 
what the pupil has written, and other factors should not be permitted to 
influence the scorer’s judgments. Pupils may be instructed to fasten sheets 
together and write identifying data only on the back of one page. 

(iii) Read all papers for one question at a time instead of reading each paper 
in its entirety. In other words, if there are five essay questions answered by 
thirty pupils, the scorer should read the answers to the first question on 
every paper, then all the answers to the second question, and so on. There 
are a number of advantages to this procedure. In the first place, the scorer's 
attention and judgment are focused on one question at a time, and this facil- 
itates a more accurate evaluation of each question than is obtained when the 
answers to a number of different questions are judged in close succession. 
Second, this procedure encourages the rating of answers on a comparative 
basis, each answer being compared with all the other answers to the same 
question. Such a practice generally provides a sounder basis for evaluation 
than a comparison of answers with some arbitrary standard of perfection. 


* Learning Exercises 9 


7. Criticize the following as an essay question: 
Discuss the historical baskground of present-day intelligence 
tests. 
8. Revise this question in the light of your criticisms, retaining its good features 
as an essay question. 
9, Write out a model answer for your revised question. Compare the answer with 
some of the source material given in the references at the end of Chapter 2, pages 32 


and 33. 


Short-Answer Questions 


As the term suggests, the short-answer item consists of a question which 
can be answered with a word or short phrase. It may be in the form of a 


direct question as, 


What is the capital of Switzerland? -------------------------- 
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or it may be in the form of an incomplete statement as, 
The capital of Switzerland is _____---------------------: 


In general, high school teachers seem to favor the short-answer type of 
item, probably because it has some of the claimed advantages both of essay 
and objective test questions. It is relatively easy to construct; it requires : 
the pupil to supply the answer; it is not difficult to score or mark (certainly 
much easier than the essay question) ; and time and space requirements will 
usually permit the use of a large number of such questions in a test, thus 
obtaining for the teacher an improved sampling without making the test 
too laborious for the student. 

While short-answer questions have the above-mentioned advantages to a 
greater or lesser degree, they have certain disadvantages as well. One of 
the foremost of these is the concentration upon specific, often unrelated, 
facts. In testing for such bits and pieces of knowledge as the capital of 
Switzerland, the number of ounces in a pound, the opposite of “big,” or 
the discoverer of the Pacific Ocean, the teacher may lose sight of and fail 
to measure the more important objectives. Furthermore, it is often dif- 
ficult to phrase the short-answer question so that only one or even a few 
answers will fit. Unless the teacher succeeds fairly well in this, it will be 
impossible to anticipate the variety of answers which pupils will think of, 
and which will have to be judged as acceptable or not acceptable. 

Since the short-answer type of question has some advantages and is 
favored by many teachers, we shall consider ways of planning and improv- 
ing the construction of such items. Listed below are some suggestions 
which should prove useful. 

(i) Select and slate the questions in such a way thal they can be answered with 
a word or a short phrase. Avoid such questions as, 


What is the best method of making angel-food cake? 


Although it might be possible to answer such a question by one word (a 
name for a method), there is nothing to prevent the willing pupil from writ- 
ing a paragraph or two describing the best method. It would be better to 


say, 
List the chief ingredients of an angel-food cake. 


Gi) Select and phrase shorl-answer questions so that only one or a very small 
number of answers will be correct. Do not say, 


Name the major cause of high prices of consumer goods. 


To such a question a dozen or more answers could probably be given, each 
supported with cogent arguments. Better to say, 


e 
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percent. oes eso. 


Such a question calls for a definite, specific, and probably undebatable 
answer. If the teacher wishes to ask a question relating to the cause or 
causes of high prices he should probably not use the short-answer form. 
It would be preferable in such an instance to use the essay form, or perhaps 
the multiple-choice form wherein the pupil is required to choose the best of a 
limited number of alternatives. 

Short-answer items are, on the whole, easy to construct, though one should 
not expect to make good short-answer items by taking statements verbatim 
from the text and merely omitting a word or two. The item should con- 
sist of a rephrasing of the idea or point being tested, or an entirely original 
statement of it. Short-answer questions are useful primarily in testing 
knowledge of facts and quite specific information. However, because of 
the shortcomings mentioned earlier, this type of item is not frequently 
used in standardized tests. 


* Learning Exercise e 
10. Construct five short-answer questions dealing with some objective of a sub- 


ject you are likely to teach. Try these out, if possible, on one of your fellow stu- 
dents. Do the results bring out any weaknesses and suggestions for improvement? 


Completion Questions 


What has been said about short-answer items applies also to completion 
items, which are very similar. However, the short-answer item nearly al- 
ways has the blank at the end, whereas in the true completion item the 
blank or blanks may occur anywhere in the statement. It is particularly 
important, in the case of completion items, for the teacher to phrase his 
questions in such a way that only a single answer will be correct. Indef- 
initeness of the question or variety of possible answers will quickly multiply 
the troubles encountered in the scoring of completion items. Listed below 
are a few additional suggestions for constructing good completion items: 

(i) Omit only significant words from the statement. Do not omit articles, 
prepositions, conjunctions, or similar words unless the purpose is to test the 
usage of such words (as in the case of a grammar test). If the statement to 
be used were, 


Democracy is that form of government in which the whole peo- 
ple or some numerous portion of them exercise the governing 
power through deputies periodically elected by themselves. 


` 
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one might devise two questions as follows: 


Democracy is that ------------ of government in which the 
whole people or - numerous portion of ------------ 
exercise the governing power ------------ deputies periodically 
elected = tee ee es. MEE ess h 

or, 
Democracy is that form of ------------ in which the whole 


or some numerous portion of them exercise the 
Ehroueh ee TIE 


governing . 
elected by ------------- 


In the second case, the essential and important ideas of government, by the 
people, power, representation, periodical elections are tested; in the first, in- 
consequential words like form, some, them, through, by themselves are required, 
and it is quite apparent that such words have no significant relationship to 
the important ideas the statement conveys. 

(ii) In omitting significant words, leave enough clues lo enable the competent 
person to answer correclly. If this principle is not adhered to, filling in the 
blanks becomes either impossible or just a guessing game. To illustrate, 


do not say, 


It is impossible to answer this item with assurance of being correct because 
dozens of ideas could be expressed and defended. However, if the statement 


were to read, 


Columbus discovered ...-.-------- ——— 


it would be far more definite. Most children who have learned a few basic 

facts of American history could answer correctly. The effect of excessive 
deletions is even more marked as statements become more complex. 

Gii) In scoring shorl-answer and completion ilems it is generally most salis- 

_ factory to allow one point credit for each blank, unless the item requires several 


words or a phrase. For example, 


In woodworking, the device used most commonly for cutting 
molding and small lumber at 45° angles is called a mitre 
box 


Here, although two words are needed to complete the statement, only one 
point should be allowed in scoring, and no partial credit should be allowed 


€ 
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for either word alone. Neither is correct by itself, and both are necessary 
to convey the essential idea. 

In the following example, however, four points would be given, one for 
each correct response: 


The four seasons are à 5 ] 
and 


This system simplifies scoring and avoids many difficulties arising from 
methods giving partial credits. 


* Learning Exercise © 


11. Completion questions have been more widely used in mental ability or gen- 
eral intelligence tests than in any other kind of tests. Can you explain why? 


True-False Questions 


In the early days of objective-test development the true-false item was 
very popular. Some of the first published tests consisted entirely of this 
type. Inrecent years its popularity has declined to such an extent that one 
finds it used only rarely in standardized tests. There are at least two reasons 
for this trend: the inherent weakness in the item itself, especially with regard 
to the large element of chance or guessing that may enter into the answer- 
ing, and the fact that it appears deceptively easy to make good true-false 
items, 

The true-false item usually consists of a declarative sentence to which the 
examinee responds by marking it true or false, thus: 


( ) The capital of Michigan is Lansing. 
» 


Or, 


( ) The first Continental Congress met in New York in 1798. 


A variety of modifications of the true-false item have been tried, nearly 
always in an attempt to lessen the effect of its chief weakness — the ele- , 
ment of chance success or guessing. One possible modification is to have the 
pupil correct the false item by crossing out the word or part that makes it 
false, and writing in the word or phrase that will make the statement correct. 
For example, the item appears on the mimeographed or printed test as 
follows: 


( ) A pair of scissors is an example of a lever of the second 
class, 
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The pupil marks it false and corrects it, thus: 


st 
hk pair of scissors is an example of a lever of pearl 
ass. 


The statement is false in that scissors are an example of first class levers. 
The item is correctly marked, as shown above, by crossing out the word 
second, and writing above it the word first. In this variation, true state- 
ments are marked by simply placing a T within the parentheses; nothing 
further needs to be done with them. 

Every true-false question calls for a choice between two alternatives. A 
statement is either right or wrong, true or false. Hence, the possibility of 
guessing the right choice, mathematically speaking, is 1 in 2, or 50 per cent. 
Therefore, with 100 such items it should be possible to get half of them right 
on the basis of pure chance alone. The pupil whose results on the test show 
that he has answered half of the 100 items right, and half wrong, can thus 
be assumed to have done no better than a pupil who has guessed his way 
through the entire test. To correct for the element of guesswork, it is 
common practice to score such tests by the formula of S = R — W, where 
S = score, R = number of true-false items answered correctly, and 
W = number of true-false items missed.’ 

In the case cited, this would work out as follows: S50 —50 —0. If 
Robert, on the other hand, gets 70 right and 30 wrong, he would have 
S = 70 —30, or 40. Items not attempted are usually disregarded in the 
scoring. For example, if James omits 12 items out of the 100, and gets 60 
right and 28 wrong of the remaining 88, he would receive a score of 60 — 28, 
or 32, the 12 omitted being disregarded in arriving at the score. 

In the case of Robert, the assumption is that if he knows the answers to 
40 items and guesses on the remaining 60, he would get half of these, or 30, 
right purely by chance. Consequently, his correct score, 40, is obtained by 
use of the formula, S = R — W, or S vuU T 30 — 40. 


James, however, is quite uncertain about 12 items and omits them; he 
knows the answers to 32 and guesses on the remainder of 56, getting half 
of them (28) right and the same number wrong. In this case, the omitted 
items must first be allowed for. 100 — 12 = 88, the number attempted. 
Of these, he knows the answers to 32 and guesses on the remaining 56, 
getting half, or 28, right and the same number wrong. His score, then, 
"nav — 28 — 32, the number he actually knows, according to the original 

28 
assumption. 
"The general formula for correction for chance success in objective tests is 


S-R-— d Here, n equals the number of choices. Ina true-false item n = 2, so the 
TE 


formula becomes S = R — W. 


c 
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Many problems that are beyond the scope of this book enter into the 
scoring of such items, yet the basic theory underlying the use of the S = R 
— W formula is essentially as presented here. Much research has been 
done on the problem, with somewhat varying results. The net effect has 
been a gradually diminished use of the true-false question despite modifica- 
tions in form or scoring. The true-false item is generally regarded as being 
less reliable than other types of objective items, largely because of this ele- 
ment of chance or guesswork, 

Another factor that has reduced the prestige and use of this type of item 
is the tendency of inexperienced and relatively untrained persons to fall 
into the error of assuming that good true-false questions are easy to con- 
struct. The inexperienced teacher will often lift sentences verbatim from 
the textbook, insert a negative, or introduce some other slight modification, 
and expect therewith to produce a good true-false item, It should be em- 
phasized that superior items are almost never formulated in this manner. 
The true-false question requires very careful planning and construction; it is 
hoped that the suggestions below will help the student materially in avoid- 
ing many of the common weaknesses of this type of question. 

(i) Do not include more than a single idea in one true-false ilem, particularly 
if one idea is true and the other is false. Except in special situations where the 
testing may be directed toward unusual objectives, it is considered better 
to make each true-false item deal with one idea, and to use a statement 
which is either wholly true or wholly false. Otherwise, the item is sure to 
be ambiguous. For example, 


( ) Lincoln spent six weeks planning his Gettysburg Address 
before delivering it on November 19, 1863. 


This statement is ambiguous because the facts are all correct except the 
amount of time spent in preparation of the speech. It probably would be 
better to make two items, thus: 


( ) Lincoln spent six weeks planning his Gettysburg Address. 


( ) Lincoln delivered his Gettysburg Address on November 19, 
1863. 


"This makes both items acceptable since one is clearly true, the other clearly 
false, and neither sentence is ambiguous. 

(ii) Avoid negative statements wherever possible. If they are used, the word 
or phrase that makes the statement a negative one should be emphasized by 
italics or underlining. A false negative statement results in a double nega- 
tive — or positive — statement, and such statements wil] nearly always 
confuse pupils. Since our primary purpose is to test learning, not to confuse 
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or test for ability to solve puzzles, there seems little justification for the use 
of statements which are ambiguous. The statement, 


( ) A true-false question should not be negatively stated. 


is true and the word which makes it a negative is italicized. Consider, 
however, the following item, a perfectly straightforward positive statement: 


( ) Washington is the capital of the United States. 


Nearly everyone would mark this correctly as a true statement. In the 
form given, it poses no special problems. If, on the other hand, the state- 
ment is changed to read, 


( ) Washington is nof the capital of the United States. 


confusion may result because of the double negative thus introduced. The 
pupils’ hesitancy in reacting to this kind of item is easily demonstrated. 
First, write the positive statement on the board and ask the class to say 
orally whether it is true or false. There will be a chorus of “True!” Erase 
the positive statement and write the negative one in its place. Now ask 
for a response, true or false. Usually some will say “ False,” some “True,” 
and the majority will say nothing. 

(iii) An approzimalely equal number of true and false ilems should be used. 
Any great, preponderance of either true or false statements might enable 
the student to detect a trend as he progresses with the test. 

(iv) Avoid long, involved slalements. especially those containing dependent 
clauses, many qualificalions, and complex ideas. These factors tend to make 
the items test reading comprehension rather than achievement in a subject. 
It is better technique to break up such long statements into two or more 
separate items which will be more easily understood and which will yield 
more exact and specific information concerning the students attainments. 


e Learning Exercises * 


12. It has been suggested by some authorities that true-false questions are better 
stated in the form of a question, as, “Ts a cow a biped?” rather than in the positive 
form, as, “A cow is a biped.” Can you see any justification for this point of view? 
Do you think it has merit? 

13. Construct ten true-false items in a field or area of your own choosing. Check 
these against the suggestions given in the preceding section. How good a job have 
you done? What is the most satisfactory way to evaluate test questions? Why? 


? € 
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Matching Questions 

This type of item is employed widely in situations where relationships of 
more or less similar ideas, facts, or principles are to be examined or judged. 
For example: 


(3) adjective l. and 

(2 noun 2. boy 

(5) verb 3. slow 
4. to 
5. write 


Here, an example of each of the parts of speech in the first column is to be 
found in the second column. The pupil indicates the correct choice in each 
case by writing the number of the correct example in the parentheses pre- 
ceding the name of the part of speech the example represents. In practice, 
the lists are usually longer than the brief illustration given above. 

A modification of the matching question is the classification item which 
may also be used to advantage in some cases. For example, in the illustra- 
tive matching set below, a list of terms is given together with a key to be 
used in identifying or classifying the terms, 


1. bird 2. mammal 3. fish 4. amphibian 
( ) eagle ( ) cow 
( ) mouse ( ) whale 
( ) alligator ( ) pike 
( ) bat etc. 


The matching type of item lends itself well to testing knowledge of words, 
dates, events, persons, formulas, tools, and many other such matters in- 
volving simple relationships or categories. Matching questions are not well 
suited to the testing of broader, concepts such as the ability to organize 
and apply knowledge, however. The suggestions which follow have been 
found helpful in improving the quality and usefulness of matching 
items. 

(i) A matching exercise usually should not contain more than len or twelve 
items. That is, the number of terms, names, etc., to be identified or matched 
should not exceed ten or twelve, because longer lists will become quite 
burdensome and tiring to the person taking the test. Where there are more 
than a dozen items to be tested, it is usually better to construct two or 
more separate matching exercises. 

(ii) The number of items in the column from which matching terms or formu- 
las are to be selected should always exceed the number of items in the opposite 
primary column or list. In other words, there should be a number of op- 
tional choices or alternatives from which to choose items for matching. 


id 
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For example, instead of this, 


(c) fast a. forte 

(a) loud b. pianissimo 

(b) very softly c. presto 

use this: 

(e) fast a. forte 

(a) loud b. legato 

(c) very softly c. pianissimo 
d. poco 
e. presto 


Tn this way, the opportunity for chance success is probably decreased be- 
cause there is less chance of arriving at choices by the process of elimina- 
tion. In the first instance, if the pupil knew two of the three terms he would 
get the third one automatically; in the second instance, he would still have 
to choose from among three remaining terms. If there are ten terms in the 
first column, there should be at least twelve possible choices in the other 
column, and as many as fifteen would be acceptable. 

Gii) There should be a high degree of homogeneity in every sel of matching 
ilems. All of the items or terms in each exercise should belong to the same 
category; otherwise, those which do not will be much easier to match 
than the rest. To illustrate: 


(f) capitals a. at end of direct query 

(g) comma b. used to show possession 

(c) period c. at end of declarative sentence 
(a) question mark d. after expression of strong feeling 
(e) semicolon e. used to show balance between 


coordinate sentence elements 


f. begin all proper nouns 
g. sets off non-restrictive clauses 


In the above example the word "capitals" is in a different category than 
the others in that it is not a punctuation mark. By looking at the list of 
options, the intelligent pupil can easily determine which one fits. He will 
quickly reason that f is the only choice which could possibly fit “capitals,” 
since all other options deal with matters of punctuation, and he will observe 
that “capitals” is the only plural form in the first column and that “begin” 
is the only verb in the second column which requires a plural subject. 
Both errors are easy to commit unconsciously in making sets of matching 


items. 
(iv) The terms in both lists or columns of a matching exercise should be 


G 
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arranged alphabetically, wherever possible. Such an arrangement will facili- 
tate finding items in the lists, and reduces the clerical task. When items 
consist of phrases, clauses or sentences, or material of a non-verbal charac- 
ter such as chemical formulas, numbers, algebraic terms, and the like, alpha- 
betizing is out of the question. However, whenever some systematic 
arrangement that does not furnish clues to correct choices is possible, it 
should be followed if the system makes the clerical task easier. Of course, in 
a test where clerical ability is being measured, the examiner might purposely 
complicate the arrangement of items. 

(v) Use the classification type of matching queslion as a variation of the 
standard form, particularly in testing such objectives as ability to apply or in- 
terpret. For example: 


a. Adjective d. Interjection g. Pronoun 
b. Adverb e. Noun h. Verb 
c. Conjunction f. Preposition 

(f) 1. We went over the river on the bridge. 

(b) 2. Turn your paper over. 

(h) 3. We have been over that once before. 

(e) 4. John scored a run. 

(h) 5. Mary said she would run fast. 

(e) 6. Running is hard work. 


Here, a variety of situations are provided in which the pupil may apply 
what he has learned about parts of speech. 

(vi) Pictorial items can often be used advantageously as a variation in the 
matching type of question. 


a. carbohydrate 
(e) b. fat 
e c. minerals 
d. protein 
(o) e. vitamins 


Such items may be used in subjects where apparatus, equipment, or tools 
are used. Pictures of objects may be matched with their names, uses, or 
other characteristics. 

Although matching questions are used quite frequently in informal and 
locally made tests, their use in published, standardized tests seems to be 
declining. This type of item does not seem to be popular with professional 
test-makers for reasons which are not altogether clear. One reason probably 
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is the limited usefulness of the matching question; another reason may be 
the lack of research data on its value as a measurement device. Neverthe- 
less, for informal tests made locally, matching questions furnish a useful 
variation and will probably continue to be used by many teachers and other 
persons who are interested in broadening and improving their own meas- 
uring techniques. 


e Learning Exercises © 


14. What kinds of learning outcomes might be efficiently measured by matching 
questions in: 
. seventh-grade English 
third-grade reading 
ninth-grade civics 
. trigonometry 
biological science 


[30-99 207 


15. Prepare a set of matching questions to measure one of the outcomes you 
mentioned in the preceding exercise. Try this out on someone at or near the appro- 
priate grade level in this subject. (In some cases this might be tried on a college or 
university student who has not studied the subject since he left grade school or high 


school.) 


Multiple-Choice Questions 
The percentages of high school teachers reporting the use of multiple- 
choice items “fairly often” or “very often," in the survey previously cited,’ 
are 50.6 and 16.4, respectively, or a total of 67 per cent. The multiple- 
choice item apparently is the most popular of the objective types. (Corre- 
sponding percentages for the matching type of item are 45.1 and 16.0, a 
total of 61.1; for true-false the percentages are 37.4 and 11.3, or 48.7 per 
cent.) E 
The multiple-choice item usually consists of an incomplete declarative 
sentence followed by a number of possible responses, one of which is clearly 
correct or best. For example: 
A. plane figure of four sides and four angles is called (7) a 
tetrahedron (2) a pyramid (3) a quadrilateral (4) a cube 
(5) an octagon 


or again, it may be in the form of a question, thus: 


Which of the following was the leading character in a famous 
play by the same name? (1) Cyrano de Bergerac (2) Silas 
Marner (3) Nathaniel Hawthorne (4) Betsy Ross (2) Shylock 


7 Noll and Durost, op. cil. 
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The multiple-choice item is probably the most versatile of the objective 
recognition types. It lends itself to a wide variety of situations, objectives, 
and content. The item can be quite objective in its scoring, it provides 
opportunity for wide coverage in the choice of alternatives, and it is not 
conducive to chance success or guessing. It is so generally regarded as the 
best and most widely applicable type of item that it has become the stock- 
in-trade, the basic type, for most standardized tests today. 

A number of suggestions and some examples of good techniques for mul- 
tiple-choice questions are given below. 

(i) Probably the most important skill in making mulliple-choice items is in 
the framing of alternatives. One choice must be clearly the best, bul the others 
must appear plausible to the uninformed, perhaps even more so than the correct 
choice. To illustrate: 


The capital of the United States is (1) Washington (2) New 
York (3) Chicago (4) St. Louis (5) Los Angeles 


Each of these choices would probably function, that is, be chosen as the 
answer by pupils at different levels, living in different parts of the country. 
As it stands, this might be a good question for Grade Four. But if we 
change the alternatives we might have this: 


The capital of the United States is (1) Washington (2) At- 
lantic City (3) Reno (4) Milwaukee (5) San Antonio 


The item is now easier because the four wrong choices are somewhat more 
obvious than those in the first example. Going still further in this direction, 
the item might become: 


The capital of the United States is (7) Washington (2) Rome 
(3) Tokyo (4) Paris (5) London 


This item is probably even easier than the others because most children in 
school who have studied any social science, whether they know that Wash- 
ington is the answer or not, could probably arrive at it by the process of 
elimination. The item can also be made absurdly easy: 


The capital of the United Statesis (7) Washington (2) wheat 
(3) China (4) air (5) birds 


The above illustrations are intended to show how the selection of alter- 
natives will affect the difficulty of a multiple-choice item and the function- 
ing of the choices themselves. An alternative which does not seem plausible 
or which appeals to no one serves no purpose in the item. If there is one 
such alternative in a five-response multiple-choice item, the item becomes, 
in effect, a four-response item. If there are two such alternatives, it be- 
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comes a three-response item, and soon. The maker of multiple-choice items 
will not be successful, regardless of what other capabilities he may have, 
unless he develops skill in formulating choices which are functional — 
choices which, even though incorrect, are plausible enough to be chosen. 

(ii) A multiple-choice item should not have more than one acceptable answer. 
Some teachers have a liking for items where more than one of the choices is 
acceptable. The difficulty with such items lies in the scoring, for which no 
generally acceptable method has been devised. For example: 


She has never liked (1) this (2) them (3) that (4) those 
(5) these 2 ee En kind of stories. 


In this case both 7 and 3 are acceptable choices, but how shall the item be 
scored in the following instances? (lt is assumed that pupils have been 
instructed that one or more choices are acceptable.) 

A chooses 1 and 2 — one right, one wrong 

B chooses 1 and 3 — two right 

C chooses 4 and 5 — two wrong 

D chooses 1, 2, 3, and 4 — two right, two wrong 


PuriLs Ricuts Ricuts — WRoNGS 
A 1 0 
B 2 2 
(0) 0 -2 
D 2 0 


If scoring is on rights alone, B and D get the same score, although B makes 
no errors while D makes two; also, D gets a higher score than A, who makes 
fewer errors. 

If scoring is Rights — Wrongs, C gets a negative score, though this is un- 
desirable; if no score lower than zero is given, C scores the same as A and 
D, who get one and two right, respectively. 

It is generally far more satisfactory to make two items — in this case, 
one with *this" as the acceptable choice, and the other with *that." Care 
must be taken, however, to vary the foils so that the correct answers will 


not be immediately obvious, as they are here: 


(a) She has never liked (1) this (2) them (3) those (4) these 
kind of stories. 


(b) She has never liked (2) that (2) them (3) those (4) these. 
kind of stories. 


Here, the fact that three of the choices are the same in both items would 
lead the pupil to deduce that the correct choice in a must be “this” and in 
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b, “that.” Such clues which lead to a correct answer, but which are irrele- 
vant to the real purpose of the item, should be avoided. 

(iii) The choices in an ilem should come al or near the end of the statement. 
When the item is stated as a question, the choices usually occur near the 
end naturally. For example, if the question is, 


Bisset. EAE (d) s doct MOST PRA LC Ee a 


there is no natural way of presenting the alternatives except as indicated. 
However, one could state the item thus: 


(A) IR fre (9) hoses Gres on. (C RATES (Des 


(CDs (ems. (Goes Sd e (08:772 (sete 


It is usually considered preferable to place the alternatives at the end of 
the item, because it seems a more natural sequence to have the question or 
problem followed by the suggested answers, and that arrangement is likely 
to cause little confusion for the person taking the test. 

(iv) In mulliple-choice items the best or correct answer should be placed 
equally oflen in each possible position. That is, if there are five choices in 
each item and a considerable number of items in the test, the best answer 
should occur with approximately the same frequency as 1, 2, 3, 4, and 5. If 
the best answer appears much more often in one position than in others, its 
position might serve as an irrelévant clue. It is also essential to have the 
best-answer position randomized; that is, the first 20 per cent of the items 
should not all be 1's, the next 20 per cent should not all be 2's, etc. If, as 
multiple-choice items are constructed, the best answer of the first one is 
placed in the No. 1 position, the best answer of the next in the No. 2 posi- 
tion, and so on, an equal distribution will be obtained. When the items are 
finally arranged, usually according to difficulty, the order of correct choices 
will automatically be randomized. 

(v) Choices should be in parallel form wherever possible. For example: 


The first activity on a cold wintry morning was 
(1) to gather wood and build a fire 
(2) eating a hearty breakfast 
(3) fishing for mackerel 
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(4) go for fresh water 
(5) a dash around the yard 


This item is improved if the wording is changed to 


The first activity on a cold wintry morning was 
(1) gathering wood and building a fire 
(2) eating a hearty breakfast 
(3) fishing for mackerel 
(4) going for fresh water 
(5) dashing around the yard 


(vi) Choices should fit the tem grammalically. For example: 


If a straight line forms a right angle with another straight line 
it is said to be 
(1) parallel 
(2) straight 
(3) perpendicular 
(4) equal 
(5) adjacent ... to that line. 


Obviously, choice 2 does not fit the concluding phrase, grammatically 
speaking, and thus could not possibly be the best answer. 

(vii) The length of the ilem and the length of choices should be determined 
by the purpose of the ilem. For example, in a test for knowledge and under- 
standing of words, an item might read: ` 


The word that means the same as “vanished” is 


(1) broadened (4) changed 
(2) disappeared (5) narrowed 
(3) decreased 


On the other hand, understanding of the theme or main idea of a story 
might be tested as follows: 


The lesson the speaker learned from his friends was that 

(1) old age can be the happiest, most useful time of life 

(2) it is not desirable for old people to wish to become young 
again or to behave in the carefree manner of youth 

(3) seeking the fountain of youth is well worth the effort 

(4) it is impossible to make a magic potion that will make 
people permanently young 

(5) you are as young as you feel and act 


(viii) The number of choices in mulliple-choice items should be at least four; 
the generally-preferred number is five. It can be assumed that as the number 


c 


134 Planning and Constructing the Teacher-Made Test 


of choices is reduced the chance factor increases. Theoretically, with four 
choices there is one chance in four of guessing the answer; with three choices 
the chances become one in three; and with-two alternatives the chances are 
mathematically the same as in a true-false item. Common practice in 
standardized tests is to use the five-response multiple-choice item. It is 
also customary to provide each item with the same number of choices, 


e Learning Exercise e 


16. Construct a multiple-choice item to measure each of the following: 
(a) Knowledge of the capital city of your state. 
(6) Understanding of a rule of punctuation, e.g., the comma. 
(c) Ability to interpret a graph or a table showing the relationship of height 
of children to their weight. 
(d) Skill in defining a problem in general science, e.g., how should a high school 
boy or girl who is overweight proceed to reduce? 


Objective Test Questions: General Suggestions 


In addition to the suggestions given previously for particular types of 
test items, there are a few precepts which are applicable to all objective 
items and, in varying degree, to completion, short-answer, and essay ques- 
tions. "These precepts are discussed and illustrated briefly below. 


Avoid ambiguity, the most common trouble of test-makers. 
Anitem is ambiguous if its meaning is not clear or if the statement of the item 
can be interpreted in more than one way. There are many ways in which 
ambiguity will develop, as has been shown in the examples of the various 
specific types of test items. Ambiguity confuses the person being tested, it 
makes objective scoring very difficult — if not impossible — and it reduces 
the reliability of the test. It is'difficult to foresee or anticipate this fault 
in a test item, though if ambiguity is detected in advance, it can often be 
corrected easily. Several suggestions are listed below, which, if practiced, 
will help the student avoid ambiguity in most kinds of test items. 


1. Strive for clear-cut, concise, exact statements. Long, complicated 
statements lead to ambiguity and are difficult to understand. 


2. Avoid negative statements, particularly in true-false items where a 
false negative statement results in a double negative. 


3. Have another person look over your test items to criticize them and 
suggest improvements. Another’s viewpoint is often helpful in detecting 
faults that the maker of the test has overlooked. 
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4, Analyze the results of individual items after a test has been given 
once or twice. The surest way to detect ambiguity is by item-analysis. 
This procedure will be discussed in the next chapter. 


5. One of the most common forms of ambiguity is found in items that 
are partly true and partly false, or which may have more than one correct 
answer. For example, consider the item: 


Coffee is imported into the United States unroasted, blended, 
and ground. 


This is true of *unroasted," but not of “blended and ground." How 
should the item be marked — true or false? If there is some special reason 
for using items of this sort, the instructions should clearly state that unless 
an item is wholly true, it is to be marked false. 


Test items should not be unnecessarily complex or difficult to 
interpret. The teacher can formulate questions which are sufficiently 
difficult for the level of ability being tested without resorting to awkward 
sentence structure and needlessly difficult vocabulary. 


Statements to be used in test items ordinarily should not be taken 
verbatim from the textbook or other instructional material. Some- 
times there may be a need to test for knowledge and understanding of the 
exact wording of a text but, this will probably not occur often. If one 
wishes to measure the extent to which pupils know, understand, and use 
what has been taught, it is desirable to test by rephrasing, reorganizing 
and restating the content so that the exact wording of the original is not 
reproduced. Instead of using the key sentence of a paragraph as a test 
item, the teacher who is testing for understanding of any central thought 
or principle should use a different form or different words than used in the 
textbook. In so doing, the teacher will discourage memorization of the 
words of a textbook and will be able to determine more adequately whether 
or not a pupil really understands and can apply what he has read. 


Restrict the types of items used in any given test. Although it 
was stated earlier in this chapter that the type of item used is determined 
in part by the purpose of the test or the objective being measured, it is 
generally desirable to restrict to two or three the kinds of items used in the 
same test. The purposes of the test can usually be achieved with a few 
item types, and the use of too many different kinds may confuse and dis- 


turb some pupils. 


Avoid giving irrelevant clues. Unless it is carefully constructed, a 
test item may, in itself, furnish an indication of the correct or expected 
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answer. For example, the following item is subject to criticism because of 
the similarity of the word ‘‘parallel’’ in the item and “parallelogram,” 
the answer: 


A four-sided figure whose opposite sides are parallel is called 

(1) a trapezoid (4) an octagon 

(2) a triangle (5) a hexagon 

(3) a parallelogram 
The teacher in his preparation of test items must always be on guard 
against inadvertently providing such clues. Listed below are several other 
ways in which irrelevant clues are commonly — though unintentionally — 
provided in the test item. 


1. There is a tendency for longer statements or choices to be true or 
correct more often than not. The test-maker should be on guard against 
this tendency. 


2. True-false items containing universals like “always,” “never,” “all,” 
and “none,” are likely to be false, while those containing “generally,” 
“usually,” “some,” “many,” and “sometimes,” are usually true. If such 
terms are used, the correct response should not be in line with the specific 
determiner; that is, more often than not the statements containing “al- 
ways," “never,” etc., should be true and those containing “some,” “usu- 
ally,” etc., should be false. 


3. The correct (or incorrect) answer to one item should not be given by 
another item. "To illustrate from a teacher-made, true-false test in United 
States history, two items were stated as follows: 


( ) The 18th Amendment gave women the right to vote. 
( ) Repeal of the 18th Amendment occurred during the ad- 
ministration of Franklin Roosevelt, 


Here, the correct answer to the first item may be deduced from the second 
item, for if the student knows that the second item is true, he also knows 
that the first must be false, 
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with general principles of test construction, with each of the commonly used types 
of objective test items, and then with other types of measurement and evaluation 
such as performance and observation. Most of the illustrative material is drawn 
from the field of industrial arts, but the principles and techniques can be applied in 


other fields as well. 


7. Morse, Horace T., and McCune, George H. Selected Items for the Testing of 
i Sludy Skills, Revised Edition. National Council for the Social Studies, Bulletin 
No. 15. Washington, D.C.: National Education Association, 1949, 81 pp. The 
main part of this bulletin consists of examples of objective test items designed to 
measure a variety of study skills such as evaluating sources of information, con- 
structing and reading graphs and tables, drawing inferences, etc. There is also 
some discussion of the teaching of study skills and constructing tests of such 
skills. 
8. Remmers, H. H., and Gage, N. L. Educational Measurement and Evaluation, 
Revised Edition. New York: Harper and Brothers, 1955. Chapters 3, 4, 7. A 
good discussion of the selection and construction of common types of short-answer 
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9. Ross, C. C., and Stanley, Julian C. Measurement in Today's Schools, Third 
Edition, New York: Prentice-Hall, Inc., 1954. Chapters 5, 6,7. The major topics 
discussed are: planning the test, preparing the test, trying out the test, and evalu- 
ating the test; general principles of test construction, the construction of specific 
types of objective tests, and the construction and use of essay examinations. The 
discussion is practical and thorough. 


10. Thorndike, Robert L., and Hagen, Elizabeth. Measurement and Evalualion 
in Psychology and Education. New York: John Wiley and Sons, Inc., 1955. Chap- 
ters 3, 4. Chapter 3 discusses planning a test, compares essay and objective tests, 
pro and con, and offers suggestions for improvement of the former type. Chapter 
4 deals with the preparation, use and analysis of results of objective tests in a 
scholarly, workmanlike treatment, 


11. Travers, Robert M. W. How To Make Achievement Tesis. New York: The 
Odyssey Press, 1950. 180 pp. A practical handbook for teachers and research 
workers on the preparation of tests for classroom use. Begins with the planning 
of an objective test and discusses the construction of commonly used types of ob- 
jective questions. Also includes useful suggestions on assembly, administration, 
and scoring of objective tests and a discussion of the significance of test scores. 


12. Weitzman, Ellis, and McNamara, Walter J. Constructing Classroom Exam- 
inalions. Chicago: Science Research Associates, 1949. 153 pp. A practical hand- 
book on the planning, construction, scoring, and analysis of results from objective 
tests in the fields of mathematics, science, English, and social studies. Includes a 
brief treatment of elementary statistical methods as used in analyzing test results 
and test items. Primarily useful for teachers at senior high school and college level. 
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ASSEMBLING THE TEST 


When the teacher or test-maker has constructed the test questions and 
recorded them, either on cards, as suggested in the preceding chapter, or in 
some other manner, he is ready to arrange the test items preparatory to 
having the test reproduced. In the case of an essay test, the sequence of 
questions presents no problem since such a test usually involves writing the 
questions on the board at the time of the examination. For tests employing 
questions of a more objective nature, however, the requirements are dif- 
ferent, In the following section we shall discuss some of the problems re- 
lating to the organization of an objective test. 


Arranging the Questions 
A number of factors have a bearing on the arrangement of questions in an 
objective test. Among these are (a) difficulty, (b) content, (c) type of item, 
and (d) anticipated use of the scores. Ordinarily, in a standardized test the 
items are arranged in accordance with most or all of these criteria at the 
same time. For example, in nearly all such tests the items are arranged in 
order of difficulty, from the easiest to the most difficult, This system has 
two advantages. First, it encourages the person taking the test by starting 
him on items he can easily manage. Second, it tends to avoid the possibility 
of the student's getting stuck on a difficult item which might not leave him 
time enough to answer many easier ones that follow. Of course, the difficulty 
of individual test items is determined on the basis of responses by groups, 
and does not necessarily coincide with the item-difficulty pattern of any 
particular individual. The assumption which must be made is that the aver- 
age difficulty based on responses of the group is the best available difficulty 
index for the individuals in the group. 
In addition to the criterion of difficulty, it is customary to group items ac- 
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cording to type — that is, true-false items in one group, multiple-choice 
in another, etc. — and then to arrange the items within each category ac- 
cording to difficulty, as just described. If the number of items of one kind 
is large, the items may, of course, be divided into two or more groups ac- 
cording to some other criterion such as content. For example, if there are 
fifty multiple-choice items and fifty true-false items to be arranged, each 
type might be subdivided into two groups of twenty-five each, particularly 
if there is some obvious and logical basis for such a division. If, in an 
arithmetic test, half of the true-false items deal with fractions and half with 
decimals, the items might be divided on that basis, and similarly the 
multiple-choice questions. In instances where part scores on these aspects 
of arithmetic are desired, such an arrangement will make it easier to obtain 
them. 

Basically then, in most objective tests questions are arranged according 
to type, and within the type groups, according to difficulty. The ordinary 
test-maker, as distinguished from the professional, usually has no basis for 
determining the difficulty of individual test items except his personal judg- 
ment. Whereas the producer of standardized tests tries out the items in 
preliminary forms to determine difficulty, the ordinary classroom teacher 
almost never has the opportunity to do this. Therefore, he does the best 
he can by grouping the items according to his own estimate of their 
difficulty. Parenthetically, a word might be added here concerning the 
grouping of items according to type. This practice is now almost univer- 
sal and the reasons are fairly obvious. In the first place, it facilitates scor- 
ing, since items all of the same type are easier to score than a mixture of 
various types. In the second place, a test in which items are grouped ac- 
cording to types of items is usually more agreeable to the examinee. 

Arranging items within the type groups according to estimated difficulty 
cannot be a very exact process, 3nd the test-maker will usually have to rely 
upon his own experience and. judgment in this matter. Tt is sometimes 
quite satisfactory to arrange items according to length, placing the short- 
estfirst. Length is never a basis for judging difficulty, but an arrangement 
of this sort may encourage the examinee by leading him to suspect that the 
hardest questions will come at the end of the test. 

Items may be arranged according to content, as in the arithmetic case 
mentioned above. In every field of concern to him, the teacher has certain 
ideas or plans for organization of the subject matter. Usually, subject 
matter is organized by units and by areas within the units. For example, 
a unit on transportation may be organized according to historical periods 
or according to kinds of transportation — land, sea, and air, or mechanical 
and animal. Items in such divisions as these may in turn be arranged ac- 
cording to type, difficulty, and content all at the same time, provided, of 
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course, that there are enough items. For short tests of twenty-five items or 
less, grouping of the items on any basis except type is not often practical or 
useful. 

There is one other situation where grouping of items is important, namely, 
when diagnosis is the purpose of the test. Let us assume that in arithmetic, 
or in English, or in reading, the examiner wishes to identify specific strengths 
and weaknesses in the pupils’ grasp of fundamentals. The first step in con- 
structing a test for diagnostic purposes is to make a careful analysis of the 
rules and skills that are basic to progress in the subject. The next step is 
to construct an adequate number of questions on each rule or skill, and then 
arrange the questions in the test so that each group constitutes a measure of 
understanding of one of these. Thus, when the test results are available, the 
responses of each pupil, as well as of the entire class, to items testing a par- 
ticular rule or skill will be easily determined. Advance arrangement of 
items for such purposes will facilitate diagnosis and remedial instruction. 
This will be discussed and illustrated in more detail in Chapter 14, 


e Learning Exercises 9 


1. What are the main factors to be considered in arranging items in an objective 
test? Discuss each briefly, emphasizing the relationships between them. 

2. What are the advantages of grouping items according to type? Are there any 
advantages to an omnibus or spiral arrangement of items? (Note: In this arrange- 
ment items occur in cycles. For example, a true-false item is followed by a 
multiple-choice item and a short-answer item, and that cycle is repeated through- 
out the test or part.) 


Preparing Directions 

When the arrangement and grouping of the test items has been deter- 
mined, the next step is to prepare directions. Before the test is duplicated, 
directions should be prepared for the test as a whole, as well as for the sub- 
tests or parts. If the test is to be used as a semester- or year-end final 
examination, a title page may be used. This gives the test a more finished 
appearance; also, if pupils are not to start work on it until preliminary direc- 
tions have been given, a title page serves to cover the test proper until the 
directions have been read and the pupils told to begin. If there is to be a 
title page, it should be set up in a form similar to the example on page 142. 

If the examination is a cooperative effort and is given to the classes of 
different teachers, there should be space for the pupil to indicate his section 
and his teacher’s name so that the papers can be readily sorted. 

If a title page is not used, the essential information can be put at the top 
of the first page of the test. 


e 
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Figure 5 
Typical Title Page for a Teacher-Made Test 


Final Examination 


English 10A 


DIRECTIONS: Do not turn the page until you are told 
to do so. The examination consists of two parts. 
Part 1 is True-False and Part 2 is Multiple-Choice. 
Directions are given in the test for each part. Please 
read them carefully and follow them exactly. You 
will have 40 minutes to work on the test. Try to 
answer every question, but if you do not know the 
answer to a question go on to the next and come back 
to the other one later. Do not skip around in the test. 
Begin at the beginning and work straight through. 
Your score on the test is the number right, 


Directions for each part should come at the beginning of the part, as fol- 
lows: 


Directions: The questions in this part are either True or False. 
Read each one carefully. If you think it is true, place a + in 
the parentheses in front of the question; if you think it is false, 
place 0 in the parentheses. 


1f the scoring is to be on the basis of Rights minus Wrongs, the following 
should be added: 


If you are not sure, but can make an intelligent guess, answer 
the question; if not, omit it. Score on this part is Rights minus 
Wrongs. 
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If the scoring is to be Rights only, the following should be added: 


Answer every question. The score is the number right. 


Tf locally duplicated answer sheets are used (see page 145), the directions 
given first above should be modified in part as follows: 


If you think it is (rue, place a + on the answer sheet opposite 
the number which corresponds to the number of the question. 
If you think it is false, place a 0, etc. 


If machine-scored or printed answer sheets are used, the directions should 
be: 
If you think itis true, make a heavy black mark (with the spe- 
cial pencil) between the dotted lines in Column 1 opposite the 
number which corresponds to the number of the question. If 
you think it is false, blacken the space between the pair of lines 
in Column 2. 


Insimilar fashion, directions for multiple-choice, matching, or short-answer 
questions should be worked out and reproduced on the test paper so that 
the directions will precede the part to which they apply. 

With younger pupils especially, and in any case where pupils are not 
accustomed to objective-type tests, it is well to include a sample item with 
the directions. For example, the following could be used after the direc- 
tions for matching items: 


EXAMPLE: 
(3) paint 1. compound 
(1) water 2. element 
3. mixture 


Paint is a mixture, so the figure 3 has been placed in the paren- 
theses in front of paint; water is a compound, so the figure 1 has 
been placed in the parentheses before waler. Mark the items 
below in the same way. Notice that there are more items in the 
right hand column than in the left, so after all the blanks have 
been filled you will have some on the right that you have not 
used. Some of the items in the right-hand column may be used 


more than once. 


Directions for the test and the sub-tests or parts should be carefully 
worked out in advance and incorporated in the test. The ideal to strive 
for is to make the test as nearly self-administering as possible, so that the 
pupil understands what he is to do with a minimum of supplementary 
explanations from anyone. 'This is especially desirable if the test is to 
be used by more than one teacher, and it is advantageous in any case be- 
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cause a test which is nearly self-administering makes for uniformity and 
objectivity of administration. Otherwise, there is a danger that supple- 
mentary instructions will not be identical when given to various pupils or 
groups at different times. 


* Learning Exercise € 


3. Write directions for a set of matching questions. Submit them to the class for 
criticism. Are the directions clear and simple enough to be easily understood by 
pupils at the grade level for which they are intended? Can they be understood and 
followed without the aid of an example? 


Reproducing the Test 


In most cases, teacher-made tests are duplicated in the school office. 
Copy is prepared by the teacher, often in handwritten form. It is important 
to make clear to the typist such matters as capitalization, punctuation, 
spacing, and provisions for marking answers. Capitalization and punctua- 
tion generally do.not cause any difficulty, provided the test-maker knows 
and indicates clearly what he wants. Words to be emphasized should be 
written in CAPITALS, or underlined, or both. 


The matter of spacing is important in setting up objective tests. All 
material on the page should be arranged and spaced in a manner which 
will make it as clear and legible as possible for the pupil. This procedure 
should be followed at all grade levels, and especially when preparing tests 
for younger children. If answers are to be indicated on the test itself, 
sufficient space should be provided and clearly indicated for that purpose. 
For example, when answers are to be marked within parentheses they 
should be spaced thus ( ), not thus (). If words or phrases are to be 
written, ample space for writing them should be provided. 

There should always be a full space between successive test items. It is 
poor economy to crowd them together; however, the statement of each 
item itself may be single-spaced. In the case of multiple-choice items, gen- 
eral practice favors setting them up in this manner: 


( ) The study of living things is called 
+ physics 

. chemistry 

biology 

geology 

astronomy 


nere 


The above method is generally preferred to the plan of listing choices thus: 


ə 
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( ) The study of living things is called (1) physics (2) chem- 
istry (3) biology (4) geology (5) astronomy 


Items should usually be numbered consecutively through the entire test 
rather than consecutively within each part. The latter procedure results 
in having two or more items with the same number, as two number f's, two 
number 2's, etc. Tf there are several parts, there will, of course, be an item 
number 1 in each, and so on. This leads to confusion, particularly if 
locally made answer sheets are used. Most teacher-made answer sheets 
are set up with numbered spaces in columns, as shown below. 


Figure 6 


Sample of a Teacher-Made Answer Sheet 
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Pupils will become confused if the numbers on the items do not correspond 
with the numbers on the answer sheet. When the numbering of items in 
different parts of the test begins in each case with number 1 , no standard 
answer sheet can be used; instead, each test must have its own answer sheet 
with numbers coinciding with those on the test. The consecutive arrange- 
ment shown in Figure 6 is nearly always followed with answer sheets accom- 
panying standardized tests. 

Any study or analysis of individual test items is also facilitated by the 
consecutive numbering system, since each number quickly identifies a par- 
ticular question and its corresponding answer, and distinguishes that. ques- 
tion and answer from all others on the test. 

Tests may be reproduced by a number of different processes, each of 
which has some advantages. If more than one method is available, the 
choice will be determined by local needs and circumstances. Objective 
tests should always be duplicated. Although some attempts have been 
made: to administer objective tests orally by reading the questions aloud, 
such methods are not recommended. The pupil should be entitled to have 
the test before him and should not be expected to keep the questions in 

mind or make snap decisions in his choice of correct answers at the mo- 
ment the questions are read aloud. 

Tests reproduced by processes available in most schools should be dupli- 
cated on one side of the sheet only. It is generally not satisfactory to use 
both sides because, unless extra-heavy paper — and extreme care — are 
used, some of the print will show through and make the material difficult to 
read. Also, when a test consists of several pages fastened together, it is 
easier for the pupil to handle if print appears on only one side of each page. 
This is especially true when tablet arm chairs are used and when answers 
are marked on the test itself rather than on à separate answer sheet. 

For ease in scoring, the test should be set up and duplicated so that 
answer spaces appear in a straight line, all at the same margin. This is easily 
arranged for true-false, multiple-choice, and matching items by placing 
parentheses for the answer in front of the number of each item. With 
short-answer and completion questions a plan similar to the example below 
may be used: 


15. The surrender of (1) took place 1. 


Here, all answers are written in the numbered spaces at the right-hand 
margin. Also, since all the answer spaces are of the same length, there is 
no irrelevant clue as to the length of the correct answer. 

If answers are written on the test itself, enough copies of the test must be 
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made so that there will be a new one for each examinee every time the test 
is administered. If separate answer sheets are used, the tests themselves 
can be used repeatedly until they are worn out. When the same test is used 
with several groups it is important to collect all copies each time it is used. 
If copies get into circulation the examination will obviously lose its useful- 
ness as a testing or measuring instrument. It is helpful to number all 
copies and require each pupil to write the number of the copy he is using 
on the answer sheet, provided separate answer sheets are used. In this way, 
missing copies of the test can be traced and perhaps recovered, 

In arranging objective tests for duplication it is not desirable to have a 
question continue from one page to the next; starting a question on a new 
page is preferable to dividing it. This is particularly important with match- 
ing questions, for the examinee would become justifiably annoyed if he 
constantly had to turn a page back and forth to consult the two parts of 
each list on two separate pages. 


The Scoring Key 


When answers are to be marked on the test proper and the blanks are 

spaced in the manner suggested above, preparation ofascoringkeyissimple.  , 
The scoring key usually consists of a strip of heavy paper or cardboard about 
an inch and a half wide and as long as the answer column on the test page. 
The correct answers are written on this strip, spaced to match exactly the 
spacing of the items on the test page. Then the strip is laid alongside the 
column of answers and the scorer checks the wrong or the right answers, 
whichever he has decided to use in scoring. The answers on the key should 
be written near the edge of the strip so that when it is laid on the test the 
correct answers will be close to the answer spaces on the test, 

When locally made answer sheets are used, the same type of scoring key 
is usable since the spaces on these answer sheets are usually arranged in 
columns — Often four to the page — of about an inch to an inch anda + 
half wide. One hundred such spaces can easily be typed, double spaced, f 
on an ordinary sheet of paper, with 25 items in each of the four columns, as 
shown on page 145. 

When the machine-scorable type of answer sheet is scored by hand, the 
most convenient key is one made of a sheet of cardboard the same size as f 
the answer sheet. Ateach place where a correct answer mark should appear - 
on the answer sheet a small hole is punched in the cardboard. This makes a ? 
stencil which can be laid over the answer sheet. Scoring is then done by. ` | 
simply counting the number of marks that appear through the holes in the $: 
stencil. When using a scoring stencil, it is necessary to scan the answer + 
sheet before covering it with the stencil to see that only one space is marked » m 
for each item. Where more than one space is marked, such items should be © 
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counted wrong unless, of course, there is more than one correct answer to 
the items in question. 


ANALYZING THE RESULTS 


After a test has been tried out and scored the results may be analyzed 
in two ways. One is from the standpoint of what the results reveal about 
the pupils’ learning, or how successful instruction has been. This means 
far more than simply tabulating total scores on the test, since high scores 
may result from an easy test as well as from good teaching, and conversely, 
low scores may be caused by a difficult test or by inferior teaching. Thor- 
ough analysis of test results involves some attempt at diagnosis, even though 
the test may not have been set up with this purpose clearly in mind. It is 
always desirable to formulate some kind of analysis showing the degree of 
success with specific items and parts of the test in order to appraise the 
efficacy of teaching and learning. Unless an attempt at such analysis is 
made, the results of the test cannot be put to maximum use. 

Another type of analysis has for its purpose the evaluation of the test as 
ameasuring instrument. How effective is the test and how well does it func- 
tion? Although careful analytical appraisal is an essential part of the 
process of producing a standardized test, this is not generally practiced 
by classroom teachers on the tests they make. Yet some evaluation of this 
sort might well be a part of every teacher’s measurement program. Unless 
teachers are willing to “test the test,” the effectiveness of their measure- 
ment techniques cannot be satisfactorily determined. 

The rest of this chapter will be devoted to the explanation of a few simple 
techniques which any test-maker can use with a test of his own construction 
to appraise its worth as a measpring instrument, 


Validity 

Tt will be remembered that validity refers to the degree to which a test 
measures what it is intended to measure. How can a teacher appraise the 
validity of his own tests? Obviously, curricular validity is one measure. 
If the teacher has constructed the test on the basis of his instructional 
objectives, and if the test covers or measures what he has been teaching, 
the test may be said to have a degree of curricular validity; that is, it is 
valid because it tests what the teacher has been teaching. 

If the teacher wishes to go beyond curricular validity, he can compare the 
test scores with scores on other tests he has given, with marks based on class 
and laboratory work, and with other measures of achievement such as 
standardized tests in the same subject. Correlation is the usual method of 
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determining the extent of such relationships, and this is explained in Chap- 
ter 3 and Appendix A. 

Another technique used by professional test-makers as a measure of valid- 
ity is the determination of the discrimination power of individual questions 
or items. In using this technique, the teacher usually must assume, first, 
that scores on the whole test have some validity and, second, that scores 
on a particular valid item should agree with scores on the whole test. Let 
us examine these assumptions. The first usually implies that the test has 
curricular validity. There may be other evidence of validity, but generally 
when item-discrimination is used as a means of establishing validity, the 
test, which is in this case the criterion of validity, is assumed to have curric- 
ular validity because it measures what has been taught. The second as- 
sumption is tested by comparing results on a given test item with scores 
on the whole test. Since the test is assumed to be valid, an item which 
agrees with the test is also valid. _ That is, an item which is answered cor- 
rectly by a higher proportion of those who make high scores on the test 
than of those who do poorly on it is functioning in a manner consistent with 
the scores on the whole test. For example, let us assume that a test has 
been given and scored, and that the papers have been arranged in order of 
score, from highest to lowest. We may call the highest one-fourth of these 
the high group and the lowest one-fourth the low group. Ina class of 40, 
the highest 10 on the test would constitute the high group, and the lowest 
10 the low group. 

Now let us consider that the results of a hypothetical test item are as 


follows: 


Trem No. 15 
Answered correclly by 
High Group 7 (of the top 10) 
Low Group 3. (of the lowest 10) 


We may conclude that this item discriminates as it should; that is, more of 
the pupils in the high group get it right than those in the low group. Such 
an item is said to have a positive discrimination value, or to discriminate 
positively. On the other hand, not every test item will discriminate posi- 


tively. For example: 


Trem No. 16 
Answered correctly by 
High Group 4 (of the top 10) 
Low Group 7 (of the lowest 10) 


This item is said to discriminate negatively because more pupils of the 
lower group get it right than pupils of the higher group. Items showing 
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negative discrimination are not uncommon, though it is the aim of the good 
test technician to identify and eliminate them whenever he can. 

The usual explanation for negatively discriminating items is ambiguity 
in the statement of the item itself. In such items the abler, more thought- 
ful pupil may see some implications which the maker of the test himself 
overlooked, and may thus be led to choose an answer that has not been 
labelled as the correct one. The less able pupil, on the other hand, con- 
sidering the item on a more superficial basis, arrives at the answer which 
has been keyed as correct. 

Sometimes the maker of a test is inclined to feel that a question missed 
by a substantial proportion of his class is a bad item or that it reflects on his 
teaching ability. Consider the following typical results: 


Trem No. 37 
Answered correctly by 
High Group 30% 
Low Group 10% 


This item is answered correctly by less than one-third of the high scoring 
pupils and by 10 per cent of the low scorers. Of the class as a whole, prob- 
ably about one-fifth would get it right. However, the item appears to be a 
good one since it discriminates clearly between the high and low groups in a 
positive direction. It is anitem of more than average difficulty and is there- 
fore justifiable, since a test with adequate range of difficulty should include 
items ranging from quite easy to fairly difficult. 

The teacher who makes a test for repeated use can carry out a simple 
item-analysis as explained here, and enter the results on the item card. 
Items which do not discriminate positively — that is, which are not an- 
swered correctly by a larger proportion of the best pupils than of the less 
able ones — should be studied fór clues to the reason for the negative dis- 
crimination. Unless negatively discriminating items can be shown to have 
other important values, or can be made positively discriminating, there 
seems little reason to retain such items, for they will detract from the va- 
lidity of the test as a whole. 


* Learning Exercise e 


4. If you can obtain copies of a test which has been given either to your class in 
measurement or elsewhere, and which has been scored, it will be interesting to make 
an analysis of some of the items. You should have at least ten papers in the top 
fourth of the class and ten in the lowest fourth. Set up a table such as the follow- 
ing one and enter the data. 
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Top FounrR Lowest FounTH Disc. 

IrEM No. Right % Right No. Right % Right VALUE 
1 
2 
3 
4 
5 

etc. 


Vind the discrimination value by subtracting the percentage of the lowest fourth 
getting the item right, from the percentage for the top fourth. Negative values 
generally denote poor items, zero differences denote non-discriminating items. 


Difficulty 

Closely related to considerations of the discrimination values of test 
items is the question of difficulty. Except for special and unusual reasons, 
items which are answered correctly or which are missed by all pupils are 
not considered good, for such items make no discrimination whatever in the 
class or group being tested. A test made up entirely of such items would 
result in everyone's getting the same score, either perfect or zero. Obvi- 
ously, such a result tells nothing about differences in achievement among 
members of the class. Rather, it demonstrates that the test was either 
entirely too easy or too difficult, and thus reflects unfavorably on the test- 
maker's ability. 

The simplest way for the teacher to determine whether his test is appro- 
priate in difficulty for the group tested is to study the distribution of scores 
on the test. If the average is at or near the middle of the range and if there 
are no perfect or zero scores, the teacher may be fairly certain that the 
test is suitable for that group. For example, on a test containing 80 ques- 
tions it is found that the average score is 42 and the range of scores is from 
11 to 75. Such facts indicate that the test was suitable in range and dif- 
ficulty for this class. Let us suppose, however, that with another class or 
grade the same test shows an average score of 69 and a range from 50 to 
80; obviously the test was too easy for this group. If, on the other hand, 
the mean is 15 and the range from 0 to 40, the test would have been too 
difficult. 

Thus it may be inferred that difficulty has a bearing on the discrimination 
value of a test — that a test which is too hard or too easy will not discrimi- 
nate between individuals of different levels of achievement as well as one 
which is more appropriate for the range of abilities in the group. 

Though absolutely fixed points for acceptable averages or range can 
rarely be established, the following general principles may serve as a useful 
guide to the student: 


* 


152 Trying Out and Evaluating the Teacher-Made Test 


1. Items which are missed or answered correctly by every pupil are 
not discriminating in that group. : 


2. On a test which is appropriate in difficulty for a given group, 
the average score should be near the middle of the range of possible 
scores, 


3. Ona test suitable for a group of the usual variability, the range of 
scores should be as wide as possible, 


4. Tests which give zero scores or perfect scores are not discriminat- 
ing for the individuals who make such scores, 


* Learning Exercise € 


5. In Exercise No. 4 on page 150 you found discrimination values for certain test 
items. Find the difficulty values for these items by averaging the per cent right in 
the top fourth and the per cent right in the bottom fourth. Do you find a satis- 
factory range of difficulty from quite easy to rather difficult? 

Find the average and the range of scores on the test. Do they indicate that the 
test was suitable for the group, too easy, or too difficult? 


Reliability 

Another characteristic of importance to the test-maker is reliability. 
This is a measure of the consistency with which a test measures whatever it 
isintended to measure, Various methods for determining reliability — the 
test-retest, equivalent forms, and the split-half methods — are described 
in Chapter 3 and in Appendix A. However, test reliability may be esti- 
mated by a shorter method which is accurate enough for classroom tests and 
other ordinary situations. It requires the calculation of only two measures, 
the mean and the standard deviation of the distribution of scores on the 
test. The formula follows: : 
— ne — M(n — M) 


" an- ° 


where 7; = reliability of the test, n = number of items in the test, c, = 
standard deviation of the scores on the test, and M = mean of scores on the 
test. 

Àn example will help to clarify the use of this formula, Suppose that 
a teacher has given a test of 50 items and has found the mean to be 30 
and the standard deviation 6. Though circumstances make it inconven- 
ient for him to use any of the usual methods of determining test reliability, 


1G, Frederic Kuder and M. W. Richardson, “The Theory of the Estimation of Test 
Reliability,” Psychometrika, 2:151-60 (September, 1937). 5 
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he must quickly ascertain the reliability of this test. To do this, he uses the 
formula given above, substituting these values: n = 50, M = 30, and 
716.— 


. [50 x (6)?] — [30 (50 — 30)] 


E (6)(50 — 1) 
_ [50 x 36] — [30 x 20] 
36 x 49 
_ 1800 — 600 
= 3764 
1200 
1764 
= .68 


Thus the teacher finds that his test has a reliability of .68, which is not very 
high. In fact, it is quite low for individual measurement and prediction. 
However, he can take some comfort from the knowledge that this formula 
nearly always gives an underestimate of true reliability, and that his test 
therefore probably has a reliability at least as high as .68. Knowing the 
reliability of his test enables the teacher to determine how much reliance 
he can place on the scores yielded by it. 


e Learning Exercise 9 


6. Obtain the results of a standardized test given to a class or group of pupils and 
calculate the reliability coefficient by the method shown above. In the manual that, 
accompanies the test you should find a reliability coefficient given. How does the 
value you obtained compare with that given in the manual? Is the difference sig- 
nificant? How do you account for any difference found? 


There are many refinements and technical details of test analysis which, 
because they are beyond the scope of the usual first course in educational 
measurement, are not mentioned here. The average teacher or counselor, 
however, should be able to follow the suggestions and procedures presented 
here; certainly if he does use such procedures he will achieve steady improve- 
ment in the construction of his tests and in his whole measurement program, 
and will thus experience a sense of genuine satisfaction in this important 


part of his job. 
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Measuring Achievement in the 


Elementary Grades 


In this chapter and the following one we shall discuss the measurement of 
Achievement in elementary and secondary grades and describe the types of 
instruments most commonly used. Since there are hundreds of standard- 
ized achievement tests available, it is obviously beyond the scope of this 
book to attempt to describe all or even a substantial number of them. 
Nevertheless, we shall describe in some detail examples of each of the im- 
portant types, and list other major tests which are available in the same 
areas. No claim is made here that the tests or other measures chosen for 
detailed description in these chapters are necessarily the best, of their re- 
spective types. Rather, an attempt has been made to select representative 
or typical instruments and to avoid emphasizing the product of any single 
publisher. Unquestionably, in many cases other equally good and equally 
representative samples might be chosen for illustration. Above all, it is 
the writer's intention to make his descriptions of the tests as instructive 
as possible by the selection of examples varying widely in organization, 
approach, format, techniques, methods of scoring and interpretation, etc, 
It is hoped that those tests described, taken together, will constitute a 
reasonably representative sample of the better achievement tests available 
today. 

A check-list or schedule is followed in the test descriptions in this chapter 
and the next so that the descriptions will be systematic and comparable 
from test to test. The items or categories in the check-list are as follows: 


Names of test and author(s) 

Nature and purpose of test 

Grade level 

Number of forms 

Publisher and date of publication 
155 
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. Cost! 

Content: source, nature, types of sub-tests, types of items 
"Time required to administer 

Directions for, and ease of, administering 

10. Validity: nature of data 

1l. Reliability: nature of data 

12. Manual: nature of, adequacy, clarity, simplicity 

13. Scoring: methods, ease, objectivity, scores 

14. Norms: type, adequacy, usability 

15. Format: paper, printing, arrangement 


pena 


Tn addition to the information presented about a particular test in the 
check-list, comments of a general nature will be made where it seems appro- 
priate and wherever such comments will be helpful in giving the student a 
better understanding of the test being discussed. 


SURVEY BATTERIES 


Survey batteries have long been used for a number of purposes. When an 
over-all measure of achievement in the common branches or subjects of 
instruction is needed for purposes of grade placement, promotion, or group- 
ing, the survey battery is most often used. It is also useful in comparisons 
among classes, schools, or school systems, among individual pupils, or in 
compatisons of the individual pupil with norms for his age or grade. The 
survey battery may be employed to good advantage in analyzing and com- 
paring a pupil's achievement in the different subject-matter areas, thereby 
revealing his strengths and deficiencies. For example, a pupil may be at or 
above the grade norms in certain areas such as reading and social studies, 
but below the norms in certain others such as arithmetic. The survey bat- 
tery is a convenient instrument for revealing such differences, and the defi- 
cient areas may be followed up with diagnostic tests to identify specific 
weaknesses. Survey batteries have the further advantage that the norms 

1 As we have pointed out in Chapter 4, cost is not the most important factor to be 
considered in choosing a test. However, it is generally of interest to all test users. For 
this reason, we have included prices in the listings and detailed descriptions of the tests, 
‘These prices will provide a basis of comparison of different tests on a factor which may 
be decisive, other things being equal. They will also make it possible for the prospective 
user to obtain some idea of the total cost of materials for a measurement program, The 
prices listed are taken from the latest available catalogs of test publishers at the time of 
publication of this book. They are subject to change, of course, and do not, as a rule, 
include the cost of separate answer sheets, although we have listed the prices of separate 
answer sheets where their use is required. Test prices do not include the cost of trans- 
portation from publisher to purchaser, The reader is strongly urged not to order tests on 


the basis of price information given in this book; instead, he should use the latest cata- 
logs of the test publishers. 
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for the sub-tests are based on the same population sample and are therefore 
directly comparable. Generally speaking, the norms will not be comparable 
where separate tests are used in different subjects. 

Survey batteries have found widest use in the elementary grades where 
much uniformity of objectives and content in fundamental areas is to be 
found. Because the survey battery is usually designed to give an appraisal 
of achievement in all of the commonly taught subjects, and since the testing 
time is usually fairly short, the measurement in each subject is necessarily 
limited, both as to scope and thoroughness. Consequently, survey batteries 
have been criticized on the basis of superficiality and narrow coverage of 
educational objectives. The makers of these tests have generally had to 
confine the scope of the test items to the knowledge and skills which are 
common to virtually all textbooks and courses of study in a particular sub- 
ject for a given grade level in order to meet the practical limitations on 
time for the use of such tests. In spite of these restrictions, many survey 
batteries have found extensive use over a long period of time, which indi- 
cates that the tests have met the needs of many teachers and administra- 


tors. 


Stanford Achievement Test 


The first of its kind to be published, this test appeared in 1923. A second 
edition was published in 1929, a completely revised edition in 1940, and the 
present edition in 1953. It has held a position of leadership in the field for 
thirty years, and is still probably one of the best-known and most widely 
used survey batteries in existence. 


l. Names of test and authors. Stanford Achievement Test? Authors: 
Truman L. Kelley, Harvard University; Richard Madden, San Diego State 
College; Eric Gardner, Syracuse University; Lewis M. Terman, Stanford Uni- 
versity; and Giles M. Ruch. 

2. Nature and purpose. Survey battery of achievement tests in common 
branches of the elementary curriculum. To provide dependable measures of 
knowledge, skills, and understanding commonly accepted as desirable out- 
comes. 

3. Grade level. End of Grade 1 through Grade 9. Primary Battery for 
end of Grade 1 through first half of Grade 3; Elementary Battery for Grades 3 
and 4; Intermediate Battery, Grades 5 and 6; Advanced Battery for Grades 7, 
8, and 9. 

4, Number of forms. J, K, L, M, and N for each Complete Battery. 
Forms Jm, Km, and Lm for Partial Batteries. 

5. Publisher and date of publication, World Book Company, 1953. 

6. Cost. Primary Battery (complete), $2.75 for 35 copies. Elementary 
Battery (complete), $3.25 for 35 copies. Intermediate Battery (complete), 
$5.40 for 35 copies. Advanced Battery (complete), $5.40 for 35 copies. Inter- 


? Quotations in test description by permission of the publisher. 
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mediate Battery (partial), $5.95 for 35 copies. Advanced Battery (partial), 
$5.95 for 35 copies. 

7. Content. The Primary Battery includes in a single 8-page booklet tests 
of Paragraph Meaning, Word Meaning, Spelling, Arithmetic Reasoning, and 
Arithmetic Computation. 

The Elementary Battery includes in a single 12-page booklet tests of Para- 
graph Meaning, Word Meaning, Spelling, Language, Arithmetic Reasoning, 
and Arithmetic Computation. 

The Intermediate and the Advanced Batteries each include in a 24-page 
booklet tests of Paragraph Meaning, Word Meaning, Spelling, Language, 
Arithmetic Reasoning, Arithmetic Computation, Social Studies, Science, and 
Study Skills. 

The Intermediate and Advanced Batteries are also available in Partial Bat- 
teries, each of which includes in a single 16-page booklet tests of Paragraph 
Meaning, Word Meaning, Spelling, Language, Arithmetic Reasoning, and 
Arithmetic Computation. 

The Paragraph Meaning and Word Meaning Tests found in all batteries are 
essentially reading tests. An example from the Paragraph Meaning Test in 
the Primary Battery is: 


Helen was sick. The girls at school wrote her a letter. “Dear 
Helen,” they said, “We hope you will soon feel 16 
enough to come back to 17 a 


16. wel ^ happy nice glad 
17. church visit school town 


In the Primary and the Elementary Batteries, the pupil draws a line under 
the one word that belongs in the space. The Intermediate and the Advanced 
Batteries are set up for machine scoring or hand scoring, so answers are indicated 
in the usual manner by blackening spaces on an answer sheet with a soft lead 
pencil, 

The Word Meaning Test is a vocabulary test in which the pupil selects a word 

. that makes a sentence true, as: 


» 


One of the seasons is 5 
year night sunshine winter 


The Spelling Test in the Primary and Elementary Batteries consists of a list 
of words which are dictated and which the pupil writes down. In the Inter- 
mediate and Advanced Batteries the test takes the following form: 


4. potatos 
18. Baked 5. potatoes are good. 182% 
6. potatose s 


(The “ING” means the correct spelling is not given.) 
The Arithmetic Reasoning Tests are made up of questions like the following: 


8. There were 9 children playing. Then 3 went home. How 
many were left? ______ 
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11. Tom runs errands for 15 cents each. If he averages 15 er- 
rands a month, what is his monthly income? 
a. lo cents b. 30cents c. $1.50 d. $2.25 e. not given 


S OME CERAME 


In the Primary Battery, the Arithmetic Reasoning Test includes 13 problems 
involving pictures. For example: 


2. Now look at the next row of pictures. Put your finger on the 
tree. See the birds? Put a cross on the eighth bird. 


Arithmetic Computation consists of problems in fundamental operations 
such as: 


(20) 94 
x2 

or (18) 69424 f 84 g 94 h 904 Mey 

(940 Mj not given 18 i 

or (42) IP = 16, B= fU g 16 " ij 


h z i32 j not given 


The Elementary Battery also includes a test of Language. It includes items 
on capitalization, punctuation, and sentence sense. Examples given below from 
the Intermediate and the Advanced Batteries will illustrate the types of ques- 
tions found in the Language Test in the Elementary Battery, also. The Inter- 
mediate and the Advanced Batteries include, besides tests in Language, tests 
in Social Studies, Science, and Study Skills. 

In the Intermediate Battery the Language Test includes capitalization and 
punctuation, grammar, and sentence sense. In the Advanced Battery the 
same objectives are tested, though in slightly different ways. All the items 
are of the alternate response (two-choice) type. For example: 


A Birrapay Party 


i UAA 

1. My 2 m is having a birthday party 1 ü 
C t $ Ye will have lunch and 25$ 

2. Can you come? , We ave lunch an ü 
E : 5 treasure island ao 

3. listen to his new record, 6 quos Tad BI Un een 
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Again: 
7. Nancy can certainly read > po 1 
8. The dog is looking for 3 X master 8 
Good sentence: 


3. Oak trees grew on the hill above 3 4 
4. Oak trees growing on the hill above HEER 


The Social Studies Tests include questions on history, geography, and civics 
of the four-response, multiple-choice, best-answer type: 


15. At an election, people 
1 sellthings 2 buy goods 3 pay fines 5 Sn ak ie! 
4 choose leaders Als Hat a 


16. The largest country in South America is a Ra 
1 Chile 2 Argentina 3 Peru 4 Brazil Sen 


12. The United States purchased Alaska 
from ASS ANE] 
5 Japan 6 Russia 7 France 8 Spain PR Millet 


The Science Tests include questions on life science, chemistry, physics, earth 
science, conservation, and health and safety. Some examples are: 


14. An evergreen tree is the 4 
5 walnut 6 pine 7 peach 8 maple 


15. Inthe United States we have the most 
hours of daylight in 
1 June 2 September 3 December 
4 March 


The last tests in the Intermediate and the Advanced Batteries are those 
dealing with Study Skills. Included are questions on reading charts, tables, 
and maps, and using the dictionary and other sources. A few examples follow: 


Use the line graph below in answering Question 6. 


Nomser or Parens Jor Sorp on Six Days 


80 
70 e 
60 

50 

40 


30 - 
20 ic al 


"Number of papers 


‘Mon. Tues. Wed. Thurs. Fri. Sat, 


6. How many papers did Joe sell on Wed- 
nesday? 
e25 f30 g35 h40 
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8. The most direct reference to the Inca In- 
dians of Peru, South America, would be 
found in an index under 
5 Inca Indians 6 South America S6 7 
7 Peru 8 Indians pH d 


Tn general, content of test items has been carefully chosen on the basis of 
word counts, analysis of textbooks and courses of study, and consultation with 
experts in the respective fields or areas. 


8. Time required to administer. The total time for each battery and 
recommended sittings or divisions are as follows: 


"Total 
Primary — Three sittings — 32, 28, and 35 minutes 1 hr. 35 min. 
Elementary — Five sittings — 32, 31, 28, 33, and 28 min. 2 hr. 32 min. 
Intermediate — Six sittings — 44, 36, 38, 38, 40, and 43 min. 3 hr. 59 min. 
Advanced — Six sittings — same as Intermediate 3 hr. 59 min. 
Intermediate (Partial) — Three sittings — 37, 39, 48 min. 2hr. 4 min. 
Advanced (Partial) — same as Intermediate (Partial) 2hr. 4min. 


Each test in each battery may be administered at a separate sitting if desired, 
or the tests may be grouped into fewer but longer sittings as appropriate for the 
age and maturity of the pupils being tested. The tests are not speeded and time 
limits are considered ample for most classes. 


9, Directions for, and ease of, administering. The directions are clear 
and well set up for teachers or for others without special training or experience 
in test administration. Directions to pupils include sample items worked out, 
and practice items which are answered with assistance from the person admin- 
istering the test, if necessary, before the test proper is begun. 


10. Validity. The evidence on validity of the Slanford Achievement Test is 
entirely curricular in nature. The manual states that “A major goal in the 
preparation of this edition of the Slanford was to insure that the content of the 
test would be in harmony with present objectives and measure what is actually 
being taught in today’s schools. To make certain that the test content would 
be valid in this sense, the construction of the new edition . . . was preceded by 
a thorough analysis of the most widely used series of elementary textbooks in 
the various subjects, of a wide variety of courses of study, and of the research 
literature pertaining to children’s concepts, experiences, and vocabulary at suc- 
cessive ages or grades." The tests were constructed according to outlines based 
on these analyses. Subject-matter specialists were consulted at each step in the 
development of the tests. 

The authors of the battery believe that the Slanford measures a substantial 
sampling of the important outcomes or goals of instruction in the common 
branches at the elementary level. It would have been helpful if at least some 
of the sources used had been specifically named. 


11. Reliability. All reliability data are in the form of split-half reliability 
coefficients and standard errors of measurement. These are given for each 
sub-test of every battery by grades. Thus the reliability coefficient for the test 
on Paragraph Meaning in the Intermediate Battery for fifth grade is .886, and 


o 
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the standard error of measurement is 3.06. Generalizing these data, it may be 
said that the 67 reliability coefficients range between .820 and .956, with the 
exception of four: Word Meaning, Grade 1, is .754; Arithmetic Reasoning, 
Grade 1, is. 664; Arithmetic Reasoning, Grade 2, is .788; and Language, Grade 5, 
is .795. 


The standard errors of measurement, with a few exceptions, are in the neigh- 
borhood of 2 to 3 grade score points. (See No. 13 below for an explanation of 
grade scores.) 

On the whole, the reliabilities, except those noted, are quite satisfactory, 
especially in view of the fact that each is calculated on the restricted range of a 
single grade? The reliability of the batteries as a whole would certainly be 
higher. Each reliability coefficient is based on approximately 250 cases in a 
given grade drawn as a random sample from pupils in 34 school systems. 


12. Manual. The manuals for the Stanford Achievement Test are clear and 
complete. There are separate manuals for the Primary and the Elementary 
Batteries, and one for the Intermediate and Advanced Batteries. Certain por- 
tions of all manuals are the same, such as those dealing with the construction, 
purposes, interpretation of scores, and suggestions for use of the tests. Differ- 
entiation is found wherever necessary for a particular battery. 


13. Scoring. All Complete Batteries are scored manually. The Interme- 
diate Partial and Advanced Partial Batteries are adapted for use with separate 
answer sheets which may be scored by machine or by hand. The scoring of the 
lower level batteries is less objective since most of the responses are indicated by 
underlining, or by writing, as in the case of the Spelling Test. The directions 
for scoring are easily followed and entirely adequate. The key is printed on 
heavy cardboard which may be cut into strips for convenience in scoring. 

The title page of every copy of a battery carries a box for recording scores on 
the separate tests, and an individual profile chart for plotting a pupil's perform- 
ance on each sub-test and comparing it with his grade norms. 

The basic score unit in all tests except Language is the number right. In 
this one, the score is the Rights minus the Wrongs because all items in this test 
are of the two-choice or alternate-response type. The raw score is converted 
by means of a table at the end of ach test to a “grade score” which represents 
the grade level for a given raw score. Whether this is the average grade level 
for all pupils making that raw score or the average score for pupils at a given 
grade level is not made clear, but it is probably the latter. Grade scores are 
expressed in whole numbers. Thus 92 means the second month of the ninth 
grade. 


14. Norms. Three types of norms are used, namely, grade, age, and per- 
centile. The grade norms are further differentiated into modal age and total 
group norms. The modal age norms are based only on the performance of those 
pupils who are “at age” for their grade. For example, modal age norms for fifth 
grade are based on children who are between the ages of 9-6 and 10-6, or those 
who entered first grade at approximately the same age (5-6) and have spent one 
year in each grade. Basing norms on modal age groups eliminates retarded and 


*It is important to know the within-grade reliabilities since tests of this kind are 
widely used for differentiation among pupils within a given grade. 
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accelerated pupils from consideration. Since there are nearly always more re- 
tarded than accelerated pupils in a given grade, the modal age norms tend to be 
higher than those of the total group. The differences are negligible in Grades 2 
and 3, but average as much as half a year in the upper grades. Grade scores are 
also convertible into percentile norms. 

A fourth type of norm is the so-called K-score. This purports to express 
scores in units that are approximately equal at all points on the scale and thus to 
provide a better method of measuring growth. 

Scores on the 1953 Edition may also be equated with earlier editions by use 
of charts or tables obtainable from the publisher. 


15. Format. Paper, printing, and general organization of the tests and 
accessories are excellent. In addition to the accessories already mentioned, 
there is a Class Record and a Class Analysis Chart which aid the user in de- 
termining what the results of the testing reveal about his classes and which 
provide a convenient permanent record for his files. The title page of each test 
booklet carrying the individual pupil's scores and profile may be removed and 
filed in his cumulative record or other folder. 


e Learning Exercises * 


1. What are some of the chief uses of survey batteries in the elementary schools? 
What are their limitations? 

2. Select an elementary survey battery for study. Read reviews of it in one of 
the Mental Measurements Yearbooks, but be sure to examine a specimen set. Write 
a 500-word appraisal of it. 


Other Survey Batteries 

1. American School Achievement Tests. 1955. Forms D and E, all levels. 

Primary I, Grade 1. Word Recognition, Word Meaning, Numbers. $2.00 
per 25. 35 (50) minutes.‘ 

Primary II, Grades 2-3. Sentence and Word Meaning, Paragraph Mean- 
ing, Arithmetic Computation, Arithmetic Problems, Language and Spelling. 
$2.50 per 25. 85 (105) minutes. 

Intermediate Battery. Complete. Grades 4-6. Reading, Arithmetic, Lan- 
guage, Spelling, Social Studies, Science. $4.50 per 25. 127 (147) minutes. 

Advanced Battery Complete. Grades 7-9. Reading, Arithmetic, Language, 
Spelling, Social Studies, Science. $4.50 per 25. 147 (170) minutes. 

Intermediate Battery Partial. Grades 4-6. Reading, Arithmetic, Lan- 
guage, Spelling. $3.45 per 25. 117 (137) minutes. 

Advanced Battery Partial. Grades 7-9. Reading, Arithmetic, Language, 
Spelling. $3.45 per 25. 137 (157) minutes. 

All answers are marked on test booklets and scored with a carbon-back device. 

Public School Publishing Company. ‘ 


4 Wherever possible, two times are given for every test; the first is the actual working 
time, and the second is the estimated total time required to administer the test. Where 
only one time is given it is the actual working time without allowance for reading direc- 


tions, etc. 
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2. California (formerly Progressive) Achievement Tests. 1950. Primary, 
Grades 1-3, Low 4. Reading Vocabulary, Reading Comprehension, Arithmetic 
Reasoning, Arithmetic Fundamentals, Mechanics of English and Grammar, 
Spelling. Forms AA, BB, CC, and DD. $.13 per copy, or in packages of 35. 
90 (110) minutes. 

Elementary, Grades 4-6. Areas same as above. Forms same as above. 
$.14 per copy in any quantity, or packages of 35. 120 (135) minutes. 

Intermediate, Grades 7-9. Areas, forms, and price same as above. 

Advanced, Grades 9-14. Areas and price same as above. Forms AA, BB, 
and CC. 

California Test Bureau. 


3. Coordinated Scales of Attainment. 1948-50. One battery for each grade, 
1-8. Battery 1, Picture-Word Association, Word-Picture Association, Vocab- 
ulary Recognition, Reading Comprehension, Arithmetic Experience, Number 
Skills, Problem Reasoning, Computation. Forms A, B. $1.95 per 25. 

Battery 2, same as Battery 1 plus Spelling. Forms A, B. $1.95 per 25. 

Battery 3, same as Battery 2. Forms A, B. $1.95 per 25. 

Battery 4, Punctuation, Usage, Capitalization, Reading, History, Geography, 
Science, Literature, Computation, Problem Reasoning, Spelling. Forms A, B. 
$3.20 per 25. IBM scorable answer booklets, $1.95 per 25. $1.60 per 25 
manually-scorable answer booklets. 

Batteries 5, 6, 7, and 8, same areas as Battery 4. Forms A, B. Prices same 
as for Battery 4. 

"Testing time: Batteries 1, 2, and 3, 90 minutes each; Batteries 4-8, 4 hours, 
16 minutes each. 

Educational Test Bureau. 


4. Iowa Every-Pupil Test of Basic Skills, New Edition. 1940-50. Ele- 
mentary, Grades 3-5; Advanced, Grades 5-9. Forms L, M, N, O, each con- 
sisting of Test A, Silent Reading Comprehension; Test B, Work-Study Skills; 
Test C, Basic Language Skills; Test D, Basic Arithmetic Skills. Elementary 
Battery, $10.80 per 35, separate tests, $3.00 per 35; Advanced Battery, $11.70 
per 35, separate tests, $3.15 per 35; specimen sets of either battery available. 

Houghton Mifflin Company. if 


5. Iowa Tests of Basic Skills. 1956. Multi-level edition for Grades 3-9, 
Vocabulary, Reading Comprehension, Language Skills, Work-Study Skills and 
Arithmetic Skills. Forms 1, 2. $.60 per copy. Special MRC answer sheets, 
$3.00 per 35 copies plus 35 pupil report folders, and copy of teacher's manual, 
class record sheets and grade percentile norms. Regular IBM answer sheets, 
$1.35 per 35. 4 hours, 39 minutes. 

Houghton Mifflin Company. 


6. Metropolitan Achievement Tests. 1946. Primary I, Grade 1 and begin- 
ning of Grade 2. Word and Phrase Recognition, Word Meaning, Numbers. 
Forms R, S, T. $2.95 per 35. 45 (60) minutes. Specimen set, $.50. 

Primary II, Grade 2 and beginning Grade 3. Reading, Vocabulary, Arith- 
metic Fundamentals, Arithmetic Problems, Spelling. Forms R, S, T. $3.60 
per 35. 85 (100) minutes. 

Elementary, Grades 3 and 4 and beginning of Grade 5. Reading, Vocab- 


Measurement of the Three R’s 165 


ulary, Arithmetic Fundamentals, Arithmetic Problems, Language Usage, Spell- 
ing. Forms R, S, T, U. $4.45 per 35. 135 (150) minutes. 

Intermediate, Grades 5-7.5. Complete: Reading, Vocabulary, Arithmetic 
Fundamentals, Arithmetic Problems, English, Literature, Geography, History, 
Civies, Science, Spelling. Forms R, S, T, U, V. $5.25 per 35. 200 (240) 
minutes. 

Intermediate, Grades 5-7.5. Partial: Reading, Vocabulary, Arithmetic 
Fundamentals, Arithmetie Problems, English, Spelling. Forms R, S, T, U, V. 
$4.20 per 35. 155 (180) minutes. 

Advanced, Grades 7-9.5. Complete: areas same as intermediate complete. 
$5.25 per 35. 220 (240) minutes; Partial: same as intermediate partial. 
$4.20 per 35. 165 (180) minutes. Specimen sets of any one level battery, $.50. 

World Book Company. 


7. Modern School Achievement Tests. 1948. Skills Edition, Grades 2-8. 
Reading Comprehension, Reading Speed, Arithmetic Computation, Arithmetic 
Reasoning, Spelling. Forms I, If. $3.10 per 35. 120 minutes. 

Bureau of Publications, Teachers College, Columbia University. 


8. National Achievement Tests, Municipal Ballery. 1938-39. First level, 
Grades 3-6. Complete: Reading Comprehension, Reading Speed, Spelling, 
Arithmetic Fundamentals, Arithmetic Reasoning, English, Literature, Geog- 
raphy, History and Civics, Health. Forms A, B. $3.75 per 25. 205 (225) 
minutes. Partial: first six tests of complete battery. Forms A, B. $3.25 per 
25. 138 (155) minutes. 

Second level, Grades 6-8. Complete: same areas as for first level. Forms 
A,B. $3.75 per 25. 202 (225) minutes. Partial: same as for first level. Forms 
A, B. $3.25 per 25. 137-8 (155) minutes. 

Acorn Publishing Company. 


9. S.R.A. Achievement Series. New Edition, 1956. First level, Grades 2-4. 
Reading, Arithmetic, Language Arts, Language Perception. Form A. $5.50 
per 20 booklets. Approximately six hours. 

Second and third levels, Grades 4-6 and 6-9. Reading, Arithmetic, Lan- 
guage Arts, Work-Study Skills. Forms A and B. $14.00 per 20 booklets. 
Answer folders, $1.60 per 20. Approximately six hours. 

Science Research Associates. 


MEASUREMENT OF THE THREE R's 


"Traditionally, the Three R's has meant reading, writing, and arithmetic. 
At one time these subjects alone constituted the curriculum in the grammar 
school, and the curriculum was unadorned and unhampered by any “fads 
and frills.” In a very real sense they are still the backbone or basic core of 
all instruction, not only in the elementary grades, but also at higher levels. 
They represent the means of communicating ideas and concepts without 
which such exchange is limited to word-of-mouth expression. In primitive 


o 
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societies transmission by word of mouth is used to convey folklore and 
some rudiments of culture from generation to generation, but no society 
has ever achieved a high degree of civilization without having first pro- 
gressed beyond oral communication. 

Every modern educational program is based on the assumption that 
pupils must know how to read, write, and “cipher.” It is hard to imagine 
how schools could function today without these rudimentary skills of the 
teachers and pupils. If teachers had to teach without them it seems 
evident that our whole system of education as we know it, and our whole 
civilization, would break down. There would be no books, no newspapers 
or magazines, no mails, no mathematics, science, or history; probably none 
of our modern systems of communication like the telegraph, telephone, or 
radio would have been invented. Societies that have not developed sys- 
tems of written communication have remained primitive. 

It is not surprising, therefore, that schools everywhere place great em- 
phasis on the development of these skills. In the kindergarten an attempt 
is made to develop reading readiness, number concepts, and skill in using 
crude writing tools such as chalk and crayon. If the child does not attend 
kindergarten, but has his first school experience in first grade he still, as a 
tule, has opportunities to develop some of these rudimentary but important 
skills and concepts through children’s story books, games, and toys. He 
learns to associate words with objects and pictures, orally at first by naming 
them, and later by recognizing the printed word; he learns to count blocks, 
or kittens, or soldiers; and he learns to hold and use crayons or other writing 
tools by coloring, copying, or drawing pictures. 

The concept of the Three R’s has broadened somewhat to include more 
than reading, writing, and arithmetic What are now considered basic 
tool-subjects include, in addition to these three, spelling and language arts 
(capitalization, punctuation, grammar, and sentence structure), as well as 
oral communication. To be sure, instruction in most of these other subjects 
was a related part of the Three R’s in earlier days, but the others were not 
generally regarded as separate subjects or areas of instruction. In the sec- 
tion which follows, consideration will be given to measurement practices and 
techniques in reading, writing, arithmetic, spelling, and the language arts, 
where these have resulted in some generally accepted and widely used in- 
struments, e.g., standardized tests. 


Reading and Reading Readiness 
Reading is perhaps the most important of the tool-subjects mentioned 
above, for it comes close to being the basis of our civilization. Illiterate per- 
sons cannot progress far in our society, nor can they be effective citizens. 


5 Gertrude Hildreth, Learning the Three R's; A Modern Interpretation, Second Edition 
(Minneapolis, Minn.: Educational Publishers, Inc., 1947). 
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It is not surprising, therefore, that reading is the first accomplishment the 
school attempts to give the child. Until he can read, the child cannot be- 
come educated according to our concepts of education. If he does not learn 
to read well he is constantly handicapped in his progress through the 
schools. . Because of the importance accorded to reading, much research 
and study have been done on the nature of the reading process, on methods 
of teaching reading, on the causes and remedies for reading disabilities, 
and, together with all of these, on the development of tests and techniques 
for measuring reading ability and skill. Literature in educational research 
contains a wealth of material on this subject. Much of this material is of 
interest and concern to every teacher, from the kindergarten to the uni- 
versity level. 

Reading tests have been mainly concerned with three areas of measure- 
ment. One of these areas is reading readiness. It has become pretty 
generally accepted that readiness to learn to read is, at least in part, a 
maturational process. That is, a child is ready to learn to read when he 
has developed intellectually and in other ways to a given level of maturity. 
Authorities differ on the exact level required, but it is generally agreed that 
it is not below the mental age or maturity of the average five-and-a-half- 
year-old. Some place the minimum a half year or so higher. Whatever 
the minimum is, and it probably varies somewhat under different circum- 
stances, it seems clear that it is very difficult and usually useless to try to 
teach a child to read before he has reached this minimum maturity level. 
A good deal of work has been done to develop tests which will give a fairly 
accurate measure of a child's readiness in a short time. 

Another area of measurement in reading concerns achievement. Read- 
ing achievement tests are available for use at all levels from first grade 
upwards. They generally are directed toward the measurement of reading 
comprehension and speed of reading. Reading comprehension is usually 
thought of as the power to understand and remember what is read, while 
speed of reading refers to the amount that is read in a given time without 
measurable loss in comprehension. Both are important reading skills, and 
most reading achievement tests are designed to measure both, either di- 
rectly or indirectly. 

The third area of measurement in reading involves diagnostic testing. 
This is most useful with the slow reader or the reader who has difficulties of 
one sort or another. The diagnostic reading test is designed to identify 
such difficulties or weaknesses and to suggest ways of overcoming them. 
One approach to this — more in the nature of a tool than a test — involves 
a machine that photographs the eye movements of the reader as he reads. 
This device has made it very clear that one demonstrable and consistent 
difference between good and poor readers is in the number of pauses the eye 
makes. The good reader makes relatively few pauses and his eyes take in 
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long fragments without stopping; the poor reader, on the other hand, reads 
jerkily, his eyes pause much more often and take in only a word or, at best, 
a few words between pauses. While the apparatus for photographing eye 
movements is not a test in the usual sense, it is a measuring instrument 
that yields information which has been found very useful in helping poor 
readers to improve. 


Gates Testing Program in Reading 


In order to orient the student to the area of measurement in reading, we 
shall describe in detail the comprehensive testing program by Gates. The 
Gates program is unique in that it provides a continuous and related series 
of reading tests from kindergarten to tenth grade. It includes tests of every 
type found in this area, with the possible exception of study skills. The 
student who gains some familiarity with, and understanding of, the Gates 
tests will have a good orientation not only to the types of tests commonly 
used but also to a wide variety of techniques of measurement in reading. 
The first tests in the Gales Testing Program are the Reading Readiness Tests. 


1. Name of tests and author. Gates Reading Readiness Tests? Arthur I. 
Gates, Teachers College, Columbia University. 


2. Purpose. To determine which children are ready to begin reading; 
how rapid their progress is likely to be; what specific abilities required in learn- 
ing to read need development. 


8. Grade Level. End of kindergarten and beginning of first grade. 
4. Number of forms. One. 


5. Publisher and date of publication. Bureau of Publications, Teachers 
College, Columbia University, 1939. 


6. Cost. $1.95 per 35. 


7. Content. Picture Directions: Three line drawings — a farm scene, a 
town scene, and interior of a general merchandise store. The examiner makes 
oral comments and asks the children to carry out certain instructions by mark- 
ing the pictures. 

Word Matching: Marking two words that are the same in each of several 
groups of four words. 

Word Card Matching; A word on a flash card is shown for five seconds. The 
word is then found and marked in a group of four words in the test booklet. . 

Rhyming: Selecting from groups of four pictured objects one whose name 
rhymes with a key word given orally; e.g., the objects — hat, dog, cup, and 
horse — are pictured and the examiner asks for the one that sounded like “pup.” 

Reading Letters and Numbers: This is given individually. The child is asked 


* Arthur I. Gates, Gates Tesling Program in Reading (New York: Teach ll 
Columbia University, 1939-1945). à m 

7 Quotations in the present description of the Gates Testing Program in Reading b; - 
mission of the publisher. Stal 
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to name capital letters, lower case letters, and numbers from 0 to 9, which are 
printed on the test blank. 

Ability toGrasp the Structure and Substance of a Story: This is not a part of the 
Readiness Test as such, but isrecommended asa valuable adjunct. The first half 
of a good story from about the middle of a primer is read aloud. Then the child 
is asked to tell what happened next. The child’s account is recorded and judged 
on general merit. 


8. Time required. The tests are non-timed and it is not required that all 
five tests be given. If all are given, the total time is about one hour. In any 
case, it is recommended that tests 1, 2, and 5 be given in the first period and 
that tests 3 and 4 be given later. 

9. Directions for, and ease of, administering. The tests are easily and 
simply administered by any competent teacher who will follow directions. 

10. Validity. The tests were selected on the basis of several extensive 
research studies using many kinds of tests to identify those most useful for 
measuring reading readiness. The most promising were tried out with an 
entire entering-school population in a small city, and on the basis of these re- 
sults a revised test was given further trial in another group of schools. On the 
basis of these data the present test was constructed. 

1l. Reliability. The reliabilities of the five parts of the test are as follows: 
Picture Directions, .84; Word Matching, .78; Word Card Matching, .82; Rhym- 
ing, .84; Reading Letters and Numbers, .96. The reliability coefficient of the 
whole test is .97. 

12. Manual. The manual is complete and clear. Besides descriptions of 
the tests, directions for administering and scoring the tests and interpreting 
scores, there are helpful suggestions for remedial work in the various abilities 
measured, and for predicting reading progress on the basis of scores on the test. 
The manual contains numerous footnote references to studies, and a selected 
bibliography. 

13. Scoring. The tests are generally scored by counting the number of 
correct answers. No prepared scoring keys are provided, but directions are 
given for making simple scoring keys. The acoring is quite objective, being 
essentially a process of counting the number of exercises correctly marked. 

14. Norms. Separate tables of percentile norms are provided for kinder- 
garten and for first-grade testing for each of the five sub-tests or parts, 


15. Format. The tests are printed on good quality paper in an 84” x 11" 
booklet of eight pages. The manual is a 6’’ X 9’’ booklet of 31 pages. 


The next tests in the series are the Primary Reading Tests. A description 
of these tests follows: 


1. Name of tests and author. Gales Primary Reading Tests. Arthur T. 
Gates. 


2. Purpose. These tests are intended to measure level and range of ability 
in Word Recognition, Sentence Reading, and Paragraph Reading. 


3. Grade level. Grade 1 and first half of Grade 2. 
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4. Number of forms. Forms 1, 2, 3. 


5. Publisher and date of publication. Bureau of Publications, Teachers 
College, Columbia University, 1943. 


6. Cost. $1.10 per 35 copies of each of the three types of tests. Manual, 
$.25. 


7. Content. Type 1: Word Recognition consists of 48 exercises, each of 
which is made up of a picture followed by four words. The task is to select the 
one word in each exercise that “tells the most about the picture.” For example: 


Type 2: Sentence Reading consists of 45 exercises, each including three sen- 
tences marked respectively I, II, and III, and followed by drawings of six ob- 
jects, three of which are answers to the sentences. Thus: 


This is a boy. I 


This is a girl. Il 
This is a box. Ill 


The child marks the drawing of the boy with Z; that of the girl, ZI; that of the 
box, III. k 

Type 3: Paragraph Reading consists of 26 exercises, each containing three 
drawings followed by a sentence (in the first third of the exercises), or a short 
paragraph directing the child to do something with the drawings. For example: 


3. Draw a line under the little 
book. 
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The hardest exercises contain paragraphs consisting of several sentences, usually 
describing the drawings or telling a little story. Following the directions re- 
quires an understanding of the sense of the paragraph. 


8. Time required. Word Recognition, 15 minutes; Sentence Reading, 15 
minutes; Paragraph Reading, 20 minutes. 


9. Directions for, and ease of, administering. These are simple and 
are printed on the title page of the test booklet. Mainly, the task is to make 
sure that pupils understand what they are to do and do their best, and the 
teacher is encouraged to do anything that helps to achieve this purpose short. 
of telling anyone the answer to any exercise. Also, no deviation from the estab- 
lished time limits is permitted. 


10. Validity. The tests are said to measure different phases of reading 
ability, and to be diagnostic. All words used in them are taken from the Gales 
Reading Vocabulary for Primary Grades, based on the speech of young children 
and primary reading materials. They are said to be related to interesting and 
important features of children’s lives. On the basis of such sources the tests are 
said to be suitable as a measure of the extent of mastery of basal vocabulary 
and independent reading ability, largely regardless of the reading materials 
used by the individual teacher. 


1l. Reliability. No statistical data on reliability of the tests are given. 
When the tests are properly used they are reported to give reasonably reliable 
results in individual diagnosis and to be highly reliable for class use. The relia- 
bility is said to depend to a great extent upon adequate explanation to the 
pupils and upon getting them to work systematically and vigorously, but with- 
out stress or excitement. 


12. Manual. The manualis actually a handbook on a remedial reading pro- 
gram. The major portion of it is given to suggestions for identifying and diag- 
nosing reading difficulties and for overcoming or reducing them. 


13. Scoring. The score in Type 1 is the number correct minus one-third the 
number incorrect. Score in Types 2 and 3 is the number correct. No prepared 
scoring keys are provided, but they can easily be prepared from a copy of the 
lest correctly marked. Directions for scoring are easily followed. In most in- 
stances the correct answers are obvious to the teacher. However, scoring is 
done strictly according to directions, with reasonable allowances for motor 
limitations, but none for other failures such as using a different type of mark 
than that called for. 


14. Norms. Grade and age norms are provided for each of the three types 
of tests. These are based on approximately 250,000 records from schools in all 
parts of the United States. Norms in the forms of 75th and 25th percentile 
scores for each type of test and by grade levels from 1.5 to 3.4 are also provided. 


15. Format. The tests are well printed, each type being in the form of a four- 
page 81^ x 11” booklet. The manual is a 6” X 9’’ booklet of 45 pages. 


The third set of tests in the Gates series are the Advanced Primary Read- 


ing Tests. These are similar in organization and content to the Primary 
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Reading Tests except that they include only Type 1, Word Recognition, 
and Type 3, Paragraph Reading — here called Type 2. They are designed 
for the second half of Grade 2 and Grade 3. 


Type 1 contains 48 exercises consisting of a picture and four words from 
which the pupil chooses one describing the picture. These differ from those in 
the Primary Tests in that they reach considerably higher levels of difficulty. 

Type 2 consists of 24 paragraphs with pictures as in Type 3 of the Primary 
Tests. Again, the exercises in the Advanced Primary are more difficult than 
those in the Primary Tesls. 

In all other respects — tests, manual, administration, scoring, norms, etc. — 
the two sets of tests are the same or quite similar. The Primary Tests require 
50 minutes for three sub-tests, the Advanced Primary require 40 minutes for 
the two sub-tests, Reliability coefficients are .89 for Type 1, and .87 for Type 2. 
Age and grade norms, but no percentile norms, are given. These are based on 
only 5,500 records as against 250,000 for the Primary Tests. The manual is very 
similar in completeness to that for the Primary Tesís, and many sections are 
identical. The cost of the separate tests in the two sets is the same. | 


The Gates Basic Reading Tests come next in this series. | 
1. Name of tests and author. Gales Basic Reading Tests. Arthur I. Gates. 


2. Purpose. These tests are said to measure speed of reading easy material 
for four different specified purposes, and accuracy of comprehension. They do 
not measure level of comprehension or reading vocabulary. 


3. Grade level. Second half of Grade 3 through Grade 8. 
4. Number of forms. Forms 1, 2, 3, 4. 


| 
| 
| 
l 

5. Publisher and date of publication. Bureau of Publications, Teachers 

College, Columbia University, 1942. 


6. Cost. $1.10 per 35-copies of any one type. Manual, $.25. 


7. Content. Type A, Reading to Appreciate General Significance, is de- 
signed to measure skill in reading for an accurate general impression of a passage. 
It is concerned with the type of reading exercised by adults in reading news- 
papers, popular fiction, etc. It consists of 24 short paragraphs, each followed 
by five words, one of which expresses the general sense or idea of the passage. 

Type B, Reading to Predict the Outcome of Given Events, is designed to 
measure what Type A measures, but also the ability to analyze facts given in 
the paragraphs in order to predict what will happen. It consists of 24 short 
paragraphs or passages, each followed by four sentences, one of which tells what 
will probably follow next after the happenings that are described in the passage. 

Type C, Reading to Understand Precise Directions, measures ability to read 
with exactness and precision, to select relevant details and subordinate others, 
and to retain the precise directions to be followed. It consists of 24 pictures, 
each followed by a brief explanation of what the picture is about. The state- 
ment includes a direction to do something with the picture to show that it is 
properly interpreted and that the instructions are understood. For example, 
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one picture shows three different weather signal flags. The statement explains 
what each means and directs the pupil to draw a line below the flag that 
means cold weather. 

Type D, Reading to Note Details, consists of 18 paragraphs, each followed 
by three questions with four (multiple-choice) answers for each question. The 
task is to answer the questions according to details given in the paragraph. The 
test is intended to measure ability to comprehend several points in a paragraph 
at once. 


8. Time required. Type A, 8 minutes for Grades 3 and 4, 6 minutes for 
Grade 5 and above. Type B, 10 minutes for Grades 3 and 4, 8 minutes for 
Grade 5 and above; Type C and Type D, same as Type B. 


9. Directions for, and ease of, administering. The tests are easy to 
administer and directions are adequate. 


10. Validity. The validity of the tests rests upon the assumptions (1) that 
the four skills measured are actually important and distinct types of reading 
abilities, and (2) that the author has been successful in devising tests that 
measure them. They represent examples of tests purporting to have what has 
been referred to earlier in this book as psychological validily. "The validity of all 
reading tests rests upon more or less similar grounds. So it may be said that 
there are probably as good reasons for assuming the validity of the Basic 
Reading Test as there are for assuming the validity of other similar instruments. 
The author could have determined the correlation between scores on his test 
and those made by the same pupils on another reading test. However, if the 
validity of the latter were no more firmly established than his, the correlation 
would show only how well the two tests were in agreement, whatever they might 
be measuring. 

1l. Reliability. Extensive series of self-correlations (presumably test- 
retest coefficients) are given by grade and for each type. With only one excep- 
tion (.76), all are in the .80's and low .90's. These are quite satisfactory, espe- 
cially considering how short the tests are, and the fact that approximately half 
of the coefficients are based on groups in the third and fourth grades. 

The intercorrelations between the four types aré also high, ranging from a 
low of .66 to a high of .92. Most of them are in the .80's, suggesting that the 
four types of tests are measuring very much the same thing. 


12. Manual. Asin the case of the Primary and Advanced Primary Tests, 
the manual is really a booklet replete with suggestions for use of the test results 
in a reading improvement program, including diagnosis of difficulties and weak- 
nesses, and practical suggestions for remedial activities. 


13. Scoring. Actual scoring of the four types of tests is quite simple. All 
scores are the number right. The scoring of Type C in which the pupil responds 
by circling, marking with an X, drawing lines, etc., probably presents some 
problems since responses are marked wrong unless executed exactly according 
to directions, and the intent of the pupil may in some cases be difficult to 
determine. 

In addition to the number correct, the scorer also determines the number 
attempted. From these an accuracy score, namely, the percentage which the 
number correct is of the number attempted, is obtained. 
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Since the tests are closely timed, the number attempted is a measure of speed 
of reading. It must, however, be considered always in relation to the accuracy 
score, since a pupil may be assumed to be reading too fast if his accuracy score 
falls below 90%. 'Thé passages, uniformly quite easy even for Grades 3 and 
4, are not designed to measure level of comprehension; therefore, the level of 
comprehension should be high unless the pupil attempts to read too fast. 


14. Norms. Reading grade and reading age norms based on about 350,000 
records are provided, These are based on scores from schools in all parts of the 
United States. Separate norms are provided for each type of test for the two 
times, 8 and 10, and 6 and 8 minutes, respectively, as described under “Time 
Required," above. These are for the number correct. In addition, there are 
norms for accuracy scores by grades from 3 to 9, inclusive. The accuracy norms 
are given for five groups: Very High, High, Medium, Low, and Very Low per- 
centages. Finally, there are norms for the number of paragraphs attempted, by 
grade and timings, for average, slow, and very slow reading. 


15. Format. Test booklets are attractively printed on 84” x 11” folders. 
The manual is a 37-page booklet, 6’’ x 9”. 


The last test in the Gates series is a measure of over-all or general reading 
ability and is more similar to other reading tests than the Gates tests already 
mentioned. It is described below. 


1. Name of test and author. Gales Reading Survey. Arthur I. Gates. 


2. Purpose. To complement the Gales Basic Reading Tests by providing 
a measure of vocabulary, comprehension, speed, and accuracy, the first two of 
which are not measured by the Basic tests. 


3. Grade level. Grade 3 through Grade 10. 
4. Number of forms. Forms I and II. 


5. Publisher and date of publication. Bureau of Publications, Teachers 
College, Columbia University, 1942. 


6. Cost. $2.40 per 35. 


7. Content. The Vocabulary section consists of 85 words ranging from the 
first to the twentieth thousand in the Thorndike word list. Each word is fol- 
lowed by five other words, one of which means the same or nearly the same as 
the key word. 

Power or Level of Comprehension is measured with 35 paragraphs from each 
of which 2, 3, or 4 words are omitted as in a completion type of exercise. For 
each omission five words are provided from which to choose one which makes 
the best sense as a completion of the blank. The paragraphs range in difficulty 
or complexity from those readily comprehended by a third-grader to some that 
are difficult for the average college freshman. 

Speed of Reading. This consists of 64 short paragraphs, each ending with a 
question followed by four words, one of which best answers the question, The 
paragraphs are easy enough to be understood by the average second-grade 
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pupil. The purpose is to see how rapidly a pupil can read such easy material 
with comprehension. 

Accuracy of Reading is not a separate part of the test, but again, as in the Basic 
tests, is a function of the number of paragraphs read and answered correctly in 
the speed test. It is expressed as the percentage which the number answered 
correctly is of the number attempted or marked. 


8. Time required. The vocabulary and comprehension sections are un- 
timed and the pupil is allowed as much time as he can use. The speed test is 
rigidly timed — 10 minutes for Grades 3, 4, and 5, and 7 minutes for Grades 6 
to 10. The total time required to give the three tests usually varies between 60 
and 90 minutes, depending upon age and maturity of pupils. 


9. Directions for, and ease of, administering. The tests are easily ad- 
ministered, and the directions are clear and easily followed. Each of the three 
partsis preceded by a practice exercise or problem. In Grade3 it is recommended 
that each part be given at asingle sitting. In Grade 4 and above, the vocabulary 
test should be given at one sitting and the remainder at another sitting. 


10. Validity. What was said above with reference to validity of the Basic 
Reading Test applies equally to the Reading Survey Test. The extent to 
which these tests measure what they purport to measure is determined by the 
author's concepts of what reading ability consists of, and his success in de- 
vising tests of these components. In fairness to these tests, it should be said 
that they are similar to other tests in the same field in many respects, and 
therefore are not out of line with current ideas in measuring reading ability. 
"They do vary in certain respects from other existing tests in this field, for exam- 
ple, in the comparative simplicity of approach. Whether this contributes to 
their validity is a matter of opinion. Certainly it is an advantage to the user. 


ll. Reliability. Reliability coefficients are given for the four scores ob- 
tainable by use of these tests as determined by correlation between the two 
forms given at different times. For vocabulary, the values range from .88 to 
.92. For comprehension, they range between .85 and .88. For speed, the 
coefficients range from .87 to .90. The reliability of the accuracy scores is some- 
where between .82 and .90, with a probable vaiue of about .85. These are not 
extremely high, but the reliability of the entire test is probably in the neighbor- 
hood of .90. Intercorrelations between the four scores average .62. This is 
lower than comparable coefficients in the Basic Reading Test. 


12. Manual. The manual is an 18-page booklet giving the necessary data 
for use of the test and the interpretation of results. Otherwise, it is mainly con- , 
cerned with the development of the concept of this test as a complement to the 
Basic Reading Tests, and of ways in which the two tests can be used as a pair 
for instruction and remedial work in reading. 


13. Scoring. Vocabulary and comprehension are scored by the formula, 
Rights — E 
test is the number correct. The accuracy score is the number correct divided by 
the sum of the number correct and the number wrong. Tables are provided 
which save the user the work of dividing. 


, to correct for chance successes. The score on the speed 
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14. Norms. Age and grade norms are given for each of the three tests, 
separately for the different time limits on the speed test. Grade norms by Very 
High, High, Medium, Low, and Very Low are provided for accuracy scores, also. 
These norms are based on over 20,000 pupils in Grades 2 to 12. 


15. Format. In makeup, materials used, printing, and other features of 
physical appearance, these tests are comparable in all respects to the others in 
the series, and are consistent with the high standards set by the best tests in 
the field. 


e Learning Exercises ®© 


3. Does the Gates Testing Program described here provide an adequate set of 
measurements of all the important outcomes of reading? If not, which are inade- 
quately provided for or omitted? 

4. Present-day reading instruction seems to stress silent reading more than oral 
reading, which formerly was emphasized a great deal. What are some of the reasons 
for this change in emphasis? Is the change desirable, in your opinion? 

5. If you were interested in making an objective test of oral reading ability, how 
would you proceed? 


Other Reading and Reading Readiness Tests 
Reapine Tests 


1. California Reading Tests. Sub-tests of California Achievement Tests, a re- 
vision of Progressive Achievement Tests. 1950. Primary, Grades 1-4.5; Ele- 
mentary, Grades 4-6; Intermediate, Grades 7-9; Advanced, Grades 9-14. 
Reading Vocabulary, and Reading Comprehension. 

Forms AA, BB, CC, and DD (Advanced: 3 forms only). $.07 per copy. 
Primary, 30 minutes; Elementary, 35 minutes; Intermediate, 50 minutes; Ad- 
vanced, 50 minutes. 

California Test Bureau. 


2. Durrell Analysis of Reading Difficulty. 1956. Grades 1-6. Silent and 
Oral Reading, Listening Comprehension, Word Analysis, Phonetics, Faulty 
Pronunciation, Writing, and Spelling. 

One form: individual. $3.25 per examiner’s kit. Additional forms and 
record blanks, $3.50 per 35. 30-45 minutes, 

World Book Company. 


3. Durrell-Sullivan Reading Capacity and Achievement Tests. 1937. Primary, 
Grades 2.5-4.5; Intermediate, Grades 3-6. 

Capacity: Pictorial — Word Meaning and Paragraph Meaning — measured 
by Oral Language. Achievement: Word Meaning, Paragraph Meaning, Spell- 
ing and Written Recall. Primary, $4.30 per 35; Intermediate, Capacity, $2.80 
per 35; Achievement, $3.30 per 35. 

World Book Company. 


4. Elementary Reading Test, Intermediate Reading Test, Advanced Reading 
Test. Sub-tests of Metropolitan Achievement Battery. 1946-49. Grades 3-4, 
5-7.5, and 7-9.5. Vocabulary and Paragraph Reading. 
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Forms R, S, and T. Elementary, $2.85 per 35; Intermediate and Ad- 
vanced, $2.45 per 35. 45 minutes. 
World Book Company. 


5. Iowa Silent Reading Tests. 1956. Elementary, Grades 4-8; Advanced, 
high school and college. Rate of Reading, Comprehension, Vocabulary, and 
Skills in Locating Information. 

Forms Am, Bm, Cm, and Dm. Elementary, $3.30 per 35; Advanced, $4.45 
per 35. Elementary, 49 minutes; Advanced, 45 minutes. 

World Book Company. 


6. Kelley-Greene Reading Comprehension Test. Grades 9-13. Reading Com- 
prehension, Directed Reading, and Retention of Details. 

Forms Am and Bm. $5.00 per 35. 63 minutes. 

World Book Company. 


7. Lee-Clark Reading Test, First Reader. 1943. Grades 1, 2. Vocabulary, 
Following Directions, Sentence Completion, Inference. 

Forms A and B. $.08 per copy. 25 minutes. 

California Test Bureau. 


8. Nelson-Denny Reading Test. 1930. Grades 9-16. Vocabulary and 


Reading Comprehension. 
Forms A and B. $4.50 per 35 tests and self-marking answer booklets, 30 


minutes. 

Houghton Mifflin Company. 

9. Nelson Silent Reading Test. 1931-39. Grades3-9. Vocabulary, Reading 
Comprehension, Ability to Note Details, Ability to Predict Outcomes. 

Forms A, B, and C. $4.50 per 35 tests and self-marking answer booklets. 
30 minutes. 

Houghton Mifflin Company. 

10. Primary Reading Test. 1943-53. Grades 2-3. Word Recognition, 
Synonyms, Antonyms; Story, Paragraph, and Sentence Meaning. 

Forms A and B. $2.50 per 25. Manual, $.25. 31 minutes. 

Acorn Publishing Company. . 

1l. Reading Tests. Sub-tests of Stanford Achievement Ballery. 1952. Inter- 
mediate, Grades 5, 6; Advanced, Grades 7-9. Vocabulary and Paragraph 


Meaning. 
Forms Jm and Km. $3.15 per 35. 37 minutes. 
World Book Company. 


12. Stroud, Hieronymus, and McKee Primary Reading Profiles. 1953, 
1955, 1957. Grades 1 and 2. Aptitude for Reading, Auditory Association, 
Word Recognition, Word Attack, Reading Comprehension. 

Levels 1 and 2. $3.60 per 35 tests, plus a Manual for Administration, a 


Class Record Sheet, and a set of Scoring Keys. 

Houghton Mifflin Company. 

13. Test of Silent Reading Comprehension. Sub-test of Iowa Every-Pupil 
Test of Basic Skills. 1942. Elementary, Grades 3-5; Advanced, Grades 5-9. 
Paragraph Meaning and Vocabulary. 
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Forms L, M, N, and O. $3.00 per 35; $3.15 per 35, plus Examiner's Man- 
uals, keys, and class summary report of scores. Elementary, 46 minutes; 
Advanced, 85 minutes. 

Houghton Mifflin Company. 


14. Traxler High School Reading Test. 1938. Grades 10, 11, and 12. Part I, 
Rate of Reading and Comprehension at that rate; Part II, Finding Main Ideas 
in a Paragraph. 

Forms À and B. 45 minutes. $2.75 per 25. Manual, $.15. 

Public School Publishing Company. 


15. Traxler Silent Reading Test. 1934-42. Grades 7 to 10. Rate of Read- 
ing; Story Comprehension, Word Meaning; Power of Comprehension. 

Forms 1, 2, 3, and 4. 46 minutes. $2.75 per 25. Manual, $.18. 

Public School Publishing Company. 


Reapiness Tests 


1. American School Reading Readiness Test. 1955. Kindergarten, Grade 1. 
Vocabulary, Discrimination of Letter Forms and Letter Combinations, Recog- 
nition of Words, Discrimination of Geometric Forms, Following Directions, 
Memory of Geometric Forms 

Form D. $2.50 per 25. 

Public School Publishing Company. 


2. Harrison-Stroud Reading Readiness Profiles. 1950. Kindergarten, Grade 
1. Visual Discrimination, Using the Context, Auditory Discrimination, Using 
Context and Auditory Clues, Using Symbols. 

One form. $3.75 per 35. 76 minutes. 

Houghton Mifflin Company. 


3. Lee-Clark Reading Readiness Test. 1951. Kindergarten, Grade 1. Visual 
Discrimination in Letters, Conceptual Maturity, Vocabulary, Following In- 
structions, Word Forms. 

One form. $.09 per copy in packages of 35. 20 minutes. 

California Test Bureau. 


4, Metropolitan Readiness Tests. 1949-50. Kindergarten, Grade 1. 
Forms R and S. $4.00 per 35. 60 minutes. 
World Book Company. 


5. Monroe Reading Aptitude Tests. 1935. Kindergarten, Grade 1. Visual 
and Auditory Discrimination, Motor Control, Oral Speed and Articulation, 
Language. : 

One form. $3.00 per 35 tests, with Manual of Directions. Teachers’ Ma- 
terial, $1.20 per set. Group Tests, 40 minutes; Individual Parts, 15 minutes. 

Houghton Mifflin Company. 


6. Murphy-Durrell Diagnostic Reading Readiness Test. 1949. Grade 1 en- 
trants. Auditory Discrimination, Visual Discrimination, Learning Rate. 

One form. $2.80 per 35. Flash Cards, $2.00 per set. 80 minutes plus 
brief individual testing in Part 3. 

World Book Company. 
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Handwriting 


Writing and speech are sometimes referred to as the “expressive language 
arts," as contrasted with reading and listening, which are called the * recep- 
tive language arts." Of these, reading and writing are the ones considered 
here. Little has been accomplished in the measurement of speech or listen- 
ing abilities, while, of course, measurement in reading is extensive — as we 
have learned in the preceding section. 

There has been considerable change in emphasis in the teaching of writing 
in recent years. Attention and effort in teaching have shifted from me- 
chanics to function. That is, in the teaching of handwriting a generation 
ago, considerable attention was devoted to the pupil's development of a 
beautiful script. Much time was spent on practice from copy and on 
imitation of symmetrical handwriting. More recently, interest has shifted 
to development of speed with legibility, and emphasis has been placed on 
handwriting as a means of communication and self-expression. Writing is 
regarded as a developmental process which presupposes the ability to think 
clearly. The objective is the ability to express ideas clearly and legibly; 
little attention is given now to the artistic qualities of handwriting. This 
is not to say that inartistic writing is encouraged or condoned; indeed, 
those who insist upon expressing their individuality through their hand- 
writing — to the point where their writing is practically illegible — should 
be discouraged from this practice. 

In measuring handwriting we find a number of attempts to produce what 
is sometimes referred to as a product scale. One of the earliest of these was 
the Thorndike Scale for Hand-Wriling of Children.’ 

The scale reproduces samples of writing varying in quality from 4 (the 
worst writing of fourth-grade children) to Quality 18 (nearly the best writing 
of eighth-grade children). Each level of quality differs from the next high- 
est or lowest by one-tenth of the difference ‘between the best and the worst 
of the formal writings of 1,000 children in Grades 5 to 8, as ranked by com- 
petent judges. A few samples follow: 


Quality 6 


8 Edward L. Thorndike, Handwriting (New York: Teachers College, Columbia 
University, 1910). Samples reproduced by permission of the publisher. 
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Quality 11 


Quality 16 


Lowthe : pag 
aman Aft anto- Marren 
camag anc: vL out moet 


Standards are given in terms of: number of letters (of familiar material) 
written per minute without substantial loss of quality, for not more than 
three minutes; quality of writing when the pupil is doing his usual kind of 
writing; and quality of writing when instructions are to write as well as 
possible. The pupil's samples of handwriting are compared with the scale 
to find that level of quality in the scale which his writing most closely re- 
sembles. 

Another scale similar to that of Thorndike is the Ayres Scale, often re- 
ferred to as the Gettysburg Edition. This was one of the most widely used 
scales ever devised, over 600,000 copies having been printed between 1917 
and 1935. It derives its name from the fact that the opening lines of Lin- 
coln's “Gettysburg Address” are used as the subject matter. The teacher 
writes on the board the first three sentences of this address and instructs 
his pupils to read and copy until familiar with it. They then copy it, writ- 
ing with ink on lined paper for exactly two minutes. The scale includes 
eight samples of levels of quality, graded from 20 to 90. The pupil's writing 
is compared with the samples for quality, and the total number of letters 
written in the two minutes is counted. 

Norms are given for both speed and quality for Grades 2 to 8. There 
are also distributions of quality and rate scores for each grade from 5 to 8, 
inclusive. The norms show a relatively constant and substantial relation- 
ship between speed and quality, and steady progression in both from grade 
to grade. 

Other scales for measuring handwriting are those by Freeman” and 


? Leonard P. Ayres, Measuring Scale for Handwriting: Gettysburg Edition (New York: 
Russell Sage Foundation, 1917). 

Frank N. Freeman, Handwriting Measuring Scale (Columbus, Ohio: Zaner-Bloser 
Company, 1930). 
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Hildreth. The Freeman scale may be used for diagnostic purposes to 
identify faults in handwriting such as too much slant or too straight, too 
heavy or too light, too angular, too irregular, and too-wide spacing. The 
Hildreth scale measures quality of manuscript writing, i.e., printing, on a 
scale of 10 to 70. The quality of 50 is shown below. 


Come To my garden 
In Spring Time and hear 
i 64 


Lateral dominance or handedness may affect a child's ability to learn to 
write. It is a rather common observation that left-handed children have 
more difficulty in learning to write and in actual writing than right-handed 
children. Sometimes the difficulty results from the fact that the teacher is 
right-handed, and consequently, she and the left-handed pupils have a 
different orientation and approach to the task. Another source of difficulty 
grows out of attempts to change the left-handed to right-handed. The 
effects of such pressures, psychologically speaking, are not completely 
understood. Emotional difficulties often seem to be associated with them, 
and handwriting, a finely organized and complex sensory-motor skill, often 
seems affected adversely. 

Teaching a left-handed child to write with his right hand should be under- 
taken only after a careful study of the individual concerned reveals that 
such a course would be desirable and wise from a psychological point of 
view. Inany case, tests of lateral dominance or handedness should be given 
to determine the status of the child in this trait before any decisions are 


made about changing him. 


e Learning Exercise 


6. See if you can find in educational literature any accounts of tests of lateral 
dominance. What is their nature; that is, what kinds of tasks or questions do they 
consistof? Isaright-handed person also likely to be right-eyed, right-footed? What 
significance has this for learning sensory-motor skills like handwriting? 

1! Gertrude Hildreth, Metropolitan Primary Manuscript Handwriting Scale (Yonkers- 


on-Hudson, New York: World Book Company, 1933). Sample reproduced by permission 
of the publisher. 
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Arithmetic 

Except for reading, probably no subject in the elementary curriculum 
has been studied and investigated more than the teaching of arithmetic. 
Numerous books and hundreds of articles have been written about it. Per- 
haps a few almost axiomatic observations about arithmetic will be useful 
before considering problems of measurement in this subject. In the first 
place, it is quite generally agreed that arithmetic is a comparatively difficult 
subject. It calls for thinking of a rigorously exact nature, understanding 
rather than memory, and ability to apply principles in different though 
analogous situations. These objectives are not easily attained, and most 
pupils do not seem to come by them naturally. 

In the second place, arithmetic is not a popular subject, partly because it 
is difficult. Pupils do not generally elect courses in mathematics, once they 
pass the required ones, unless they have a special interest in it or unless their 
educational or vocational objectives require it. j 

The above statements might be interpreted to mean that arithmetic is 
inherently difficult and distasteful to many. This may be a safe assump- 
tion, yet it should be remembered that through improved teaching it may 
be possible to make arithmetic more liked and less difficult. Many stu- 
dents of the problem maintain that arithmetic has been poorly taught. Tt 
is said that the emphasis has been on memorizing and mechanical learn- 
ing rather than on functional use and understanding. The problems and 
activities of arithmetic have had little relationship to the lives and activi- 
ties of people. It is believed that a reorientation and reorganization of the 
teaching of arithmetic to emphasize meaning, understanding, and applica- 
tions would do much to make it more functional, less difficult, and conse- 
quently less distasteful. "Today's textbooks and courses of study show evi- 
dences of thought and effort in this direction. 

"There are numerous published, more or less standardized, tests in arith- 
metic, A large number of these are older fests, published twenty-five or 
even more years ago, but still available. In the Fourth Mental Measure- 
menls Yearbook more than twenty tests in arithmetic, not including the 
older tests just mentioned, are reviewed. Even in most of the recent tests, 
however, the pattern of organization and types of exercises are not sub- 
stantially different from those found in the older ones. There is some indica- 
tion in a very few cases of an attempt to break away from this pattern, but 
such departures are not typical. "This may not be the fault of those respon- 
sible for producing the tests; it is quite possible that teachers themselves are 
not yet ready to use tests that depart significantly from established patterns 
because their teaching is still quite traditional. Also, a test which is too 
different might reveal inadequacies in the pupils’ learning about which the 
teacher may be somewhat sensitive. 
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Coordinated Scales of Attainment: Arithmetic 


The test of arithmetic to be reviewed here is not radically different from 
conventional types. It is, however, a basically good example of the kind of 
arithmetic tests that are found in the catalogs of practically every test 
publisher and that are being used by the great majority of teachers in our 
schools today. 


1. Name of test and author. Coordinated Scales of Allainment: Arith- 
melic.? L. J. Brueckner, University of Minnesota. 

2. Purpose. This is part of a battery covering all the common branches, 
but is available as a separate test. Designed to measure important outcomes of 
instruction in two areas: computation and reasoning. 


3. Grade level. Grades 4-8, inclusive. 
4. Number of forms. Forms A and B. 
5. Publisher and date of publication. Educational Test Bureau, 1946. 


6. Cost. $1.50 per 25 test booklets. Answer sheets required; $.75 per 25, 
Scoring key, $.10. 

7. Content. Arilhmelic Computation. The answer sheet has 45 problems 
involving the fundamental operations with whole numbers, decimals, fractions, 
percentage, and mensuration problems. Samples: 

Multiply AT 
x .06 


Write as a decimal 3.12% 


Subtract 15 min. 17.5 sec. 
— 2 min. 22.8 sec. 


Divide 13+ 20= 


The problems are worked on the answer sheet. For each problem there is an 
answer line followed by five spaces for marking as in the usual printed answer 


sheet, thus: 
1552599) AMD 


On the test blank there is a matching page of answers for the 45 problems. The 
pupil works all of them or as many as he can and then compares his answers 
with those on the matching page. There are four answers given for each prob- 
lem. If his answer corresponds to one of those given he indicates this on the 
answer sheet; if not, he marks the D space. There are five problems of the 45 
which have no correct answer given and that should be marked D. The manual 


? Quotations in test description by permission of the publisher. 
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states that these are included “to discourage attempts at illegitimate ways of 
getting the answer.” They are not counted in the score. 

Problem Reasoning. In this part there are 40 problems which are intended to 
measure ability to select from four suggested solutions the correct solution to 
each problem. Samples: 


A. Bill had 14 marbles after he gave 2 to Paul. 
How many marbles did Bill have at first? 


1. Subtract 2 from 14. 
2. Add 14 and 2. 

3. Multiply 14 by 2. 
4. Divide 14 by 2. 


B. Jean shared her 8 cookies equally with 3 friends. 
How many did she give to each friend? 


1. Add 8 and 3. 

2. Subtract 3 from 8. 
3. Multiply 8 by 4. 
4. Divide 8 by 4. 


The exercises involve common problems such as interest, taxes, profit and 
loss, commissions, installment-buying charges, ratio and proportion. 

The problems are typical of the kinds of problems that have been used in 
textbooks and tests in arithmetic for many years. The type of response called 
for is unusual in that the pupil does not actually have to work the problem to 
the point of getting an answer. 


8. Time required. The tests are not timed and are said to be power tests. 
Work is stopped when 90 per cent of the pupils have finished, but the slower pupils 
may be brought back for a special session at a later time to complete their work. 
The estimated time for the Computation is 45 minutes, and for the Reasoning, 
30 minutes. 


9. Directions for, and ease of, administering. Directions are complete 
and quite clear, but administration of the Computation is complicated because 
of the use of the matching answers feature described above. This seems likely 
to confuse some pupils, especially at lower grade levels where separate answer 
sheets are not widely used or recommended. Since the tests are not timed and 
directions are adequate, the test is easy to administer. 


10. Validity. The validity of the test rests upon two bases. The first is the 
result of analysis of some forty-five state and city courses of study. This revealed 
a common core of material being taught at each grade level and the tests were 
based on this common core. Experimental forms were constructed and ad- 
ministered to a representative sample of pupils in the grade for which the test was 
intended, and the grades immediately below and above this one. The papers 
in the middle twenty per cent of the scores of each of the three grade groups 
were selected for analysis. Each item was evaluated on the criterion of a higher 
percentage of passes by the grade for which it was intended than by the grade 
below. In the final forms, only items meeting this discrimination criterion and 
falling within the middle fifty per cent of the difficulty range were retained. 


—Á 
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Second, in the case of the Reasoning test, further evidence of validity is in- 
ferred from correlations between scores on it and (1) a standardized computa- 
tion test, and (2) a control test equivalent in every way to the Reasoning test 
except that pupils actually worked out solutions to the problems instead of 
merely identifying the correct method. The correlation with the standardized 
computation test was .46; with the,control test, .46; between the control test 
and the computation test it was .87. These results are interpreted to indicate 
that the Reasoning test measures a somewhat different function than the 
conventional type of test in which the pupil works the problem and gets an 
answer. In other words, since the correlation between the control test and the 
computation test is substantially higher than that between the Reasoning test 
and these two, the control test and the computation test are judged to be 
measuring much the same thing; yet neither one measures to as great an ex- 
tent what the Reasoning test measures. It is stated in the manual that “these 
results provide strong evidence that the traditional arithmetic problems test 
measures to a great extent the same abilities as does the computation test and 
that a score on such a problems test provides an inaccurate measure of the 
reasoning abilities involved in arithmetic problem solving. The new type of 
problem reasoning measurement used in the experimental tests was, therefore, 
adopted for use in preliminary and final forms of the present tests because it 
separates more distinctly the measurement of problem reasoning abilities from 
measurement of computational abilities." !* 

As indicated, the preceding statement is quoted from the Master Manual. 
The critical user of the tests may be inclined to question the interpretation 
suggested, since a different one is possible. 


1l. Reliability. The corrected correlations by odd-even halves of the arith- 
metic tests are: 


GRADES 
4, 5,6 7,8,9 
Computation .961 + .002 955 + .002 
Reasoning 913 + .005 844 + .008 


The probable error of measurement is reported to be “less than two score 
units." 4 

These data indicate a high reliability, with the exception that the Reasoning 
test in Grades 7, 8 and 9 is somewhat below desirable standards for a test of this 


sort. 


12. Manual. There are three booklets accompanying these tests. One isa 
set of directions for administering and scoring. This is clear and complete. 

The second booklet is the Master Manual for the entire battery, of which the 
arithmetic tests are a sub-test. This also contains directions for administering 
and scoring, but in addition it gives directions for interpreting the scores, dis- 
cusses norms and the preparation of a cumulative record form for individual 
pupils, presents data on the preparation of the tests, their validity and reli- 
ability, and introduces the analysis of errors as a feature of the tests. 


13 Master Manual for Coordinated Scales of Attainment, p. 23. 
u Op. cit., p. 19. 
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The third booklet, the Guide to Remedial Work, takes up the analysis of 
errors in detail. An analysis of the content of each test is given and specific test 
items are identified with respect to the particular content they test. If the errors 
and omissions of the pupil are checked against this analysis his particular 
weaknesses will be revealed. If the errors and omissions for a class or group 
are tabulated on the Class Analysis Chart, a picture of the common weaknesses 
is obtained. This is recommended as a basis for remedial instruction. 

Altogether, these manuals provide a complete and useful guide to the use of 
the battery and, to a lesser degree, for any sub-test. 


13. Scoring. The tests may be scored manually or by machine. A column 
or strip key may be used where a small number of pupils have been tested; a 
cut-out stencil to fit over the answer sheet is provided for hand or machine scor- 
ing. The scoring throughout is on the basis of Rights only. 

The Rights score on each test is converted to a Scaled Score, the basis of 
which is not explained in any of the booklets. It is implied that this is a scale of 
approximately equal units along the whole range of raw scores. It is called 
“the score.” 


14. Norms. More than 50,000 pupils, being “a carefully selected sampling 
of pupil attainment in all sections of the country,” constituted the normative 
population for the entire battery. In Grade Four, 4,691 were tested; in Grade 
Five, 7,754; in each of Grades Six through Eight, over 9,000. Schools were care- 
fully chosen on the basis of such factors as size and type of community, socio- 
economic levels, and geographical location, to constitute a representative sample 
of the total population. Schools in forty states were included. 

Norms are of two types, namely, grade equivalents and percentile rank. The 
scaled scores mentioned above are converted into either type of norm by means 
of tables of equivalents. The grade equivalents appear on the profile chart 
which, very conveniently, is on the back of the pupil’s answer sheet. Percentile 
ranks are determined by use of a Percentile Rank Indicator which fits over the 
profile chart. 


15. Format. The tests and manuals are well printed on good stock in 83" 
X 11” booklets. Answer sheets differ somewhat from established patterns and 
do not appear to be up to the standard of the best. The arrangement by which 
the computation problems appear on the answer sheet, which, after working, 
must be matched with a set of answers in the test booklet, is somewhat unusual. 
The manual states that this arrangement has caused no difficulties in administer- 
ing the test, however. 

Scoring keys are printed on heavy stock. They are not very accurately cut for 
placement on the answer sheet, and, in general, are somewhat less professional in 
appearance and workmanship than comparable ones of the same type. 

Test accessories, in addition to those already mentioned, include a class record 
sheet, a normal progress chart (a cumulative attainment record for the indi- 
vidual pupil from Grades Four to Eight), and a score tabulation sheet for mak- 
ing a frequency distribution of scores. 
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* Learning Exercises € 


T. There seem to be more standardized tests in arithmetic than in any other 
subject in the elementary curriculum, except possibly reading. How do you ac- 
count for this? 

8. We have not discussed in this section any so-called diagnostic test in arith- 

| metic. If you can obtain a specimen set of such a test, examine it carefully, espe- 

| cially the manual. How does it differ from the one described here? What makes 
a test diagnostic? 


Other Arithmetic Tests 


1. Arithmetic Tests. Sub-test of Metropolitan Achievement Battery. 1948-50. 
Elementary, Grades 3, 4; Intermediate, Grades 5, 6; Advanced, Grades 7-9, 
Computation and Problem Solving. 

1 Forms R, S, T (Form T: Intermediate and Advanced only). $2.45 per 35. 
Elementary, 75 minutes; Intermediate and Advanced, 90 minutes each. 
World Book Company. 


| 2. Arilhmelic Tesis. Sub-tests of Stanford Achievement Ballery. 1953. Ele- 
mentary, Grades 3, 4; Intermediate, Grades 5, 6; Advanced, Grades 7-9. Rea- 
i soning and Computation. 
] Forms J, K, L, M (Forms L and M: Intermediate and Advanced only). Ele- 
mentary, $2.10 per 35; Intermediate and Advanced, $2.60 per 35. Elementary, 
55 minutes; Intermediate and Advanced, 70 minutes each. 

World Book Company. 


| 3. Basic Arithmetic Skills. Sub-test of Iowa Every-Pupil Tests of Basic Skills. 
1940-43. Elementary, Grades 3-5; Advanced, Grades 5-9. Vocabulary and 
Fundamental Knowledge, Fundamental Operations and Problems. 
Forms L, M, N, O, P (Form P: Advanced only). Elementary, $3.00 per 35; 
Advanced, $3.15 per 35. Elementary, 57 minutes; Advanced, 63-68 minutes. 
Houghton Mifflin Company. 


4. Brief Survey of Arilhmelic Skills. 1947, Grades 5-12. Fundamental 
Operations, Fractions, Decimals. 

Forms A, B. $.05 per copy. 10 (15) minutes. 

Educational Records Bureau. 


5. Brueckner Diagnoslic Arithmelic Tests. 1926, 1943. Grades 4-8, Grades 
5-8. Whole Numbers, Fractions and Decimals, each in a separate booklet. 

One Form. $1.40-$1.65 per 25. Whole Numbers, 30 minutes; Fractions, 
140 minutes; Decimals, 65 minutes. 

Educational Test Bureau. 


6. California Arithmetic Test. Sub-test of California Achievement Test Bal- 


tery. 1951. Primary, Grades 1-4; Elementary, Grades 4-6; Intermediate, 
Grades 7-9; Advanced, Grades 9-14, Arithmetic Reasoning and Arithmetic 


Fundamentals. 
Forms AA, BB, CC, DD. $.07 per copy. 50-74 minutes, 
California Test Bureau, 
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7. Diagnostic Tesls and Self- Helps. 1955. Grades 3-12. Screening tests in 
Whole Numbers, Fractions, Decimals, and Arithmetic, plus 23 tests covering 
Addition, Subtraction, Multiplication, Division, and all the major areas (Frac- 
tions, Decimals, Per Cent, etc.) diagnostically, with self-helps on the back of 
each test. 

One form. $.02 per test; $.50 per set. Untimed; complete set would require 
several hours. 

California Test Bureau. 


8. Functional Evaluation in Mathematics. 1952. Elementary Level, Grades 
4-6; Upper Level, Grades 7-9. Quantitative Understanding, Problem Solving 
and Basic Computation. 

One form. $1.10 to $1.95 per 25. (The three areas are in the form of separate 
tests at each level — six separate tests in all.) 25 (30) minutes. 

Educational Test Bureau. 


9. Lee-Clark Arithmetic Fundamentals Survey Test. 1944. High school. 
Twenty basic processes of arithmetic. 
Forms A, B. $.05 per copy. 25 minutes. 
California Test Bureau. 


10. Los Angeles Diagnostic Tests: Fundamentals of Arithmetic. 1925-26. 
Grades 2-8. Fundamental Operations with Whole Numbers, Fractions and 
Decimals. 

Forms 1, 2. $.08 per copy. 40 minutes. 

Reasoning in Arithmetic. Grades 3-9. One-step Problems, Two-step Prob- 
lems, Denominate Numbers, Percentage, etc. 

Forms 1, 2. $.08 per copy. 30 minutes in Grades 3-5; 40 minutes in Grades 
6-9. 

California Test Bureau. 


11. National Achievement Tests: Arithmetic Fundamentals and Reasoning. 
1938-48, Grades 3-6. Computation, Number Comparisons, Problem Analy- 
sis, Problem Solving. Forms A, B. $2.50 per 25. 60 minutes. 

Arithmetic Reasoning. Grades 3-8. Comparisons, Problem Analysis, Prob- 
lem Solving. Forms A, B. $2.50 per 25. 40 minutes. 

Arithmetic Fundamentals and Reasoning. Grades 6-8. Computation, Num- 
ber Comparisons, Problem Analysis, Problem Solving. Forms A, B. $2.50 
per 25. 60 minutes. 

Acorn Publishing Company. 


12. Number Fact Check Test. 1946. Grades 5-8. Addition, Subtraction, 
Multiplication and Division Facts. 

Forms A, B. $.05 per copy, plus $.60 for scoring stencil. 25 minutes, 

California Test Bureau. 


13. Wisconsin Inventory Tests in Arithmetic. Reprinted, 1955. Covers the 
100 first decade combinations in Addition; the 100 fundamental combinations 
in Subtraction and in Multiplication; 76 of the most difficult combinations 
in Short Division; 136 most difficult combinations in Higher Decade Addition; 
175 bridging combinations in Multiplication; 45 combinations which give zero 
quotients in Short Division; 11 difficulty levels of Long Division; 100 combina- 
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tions in the Addition of Mixed Numbers; 39 combinations in Subtraction of 
Mixed Numbers; One-Step Problems; Denominate Numbers. 

One form. Tests I to VII, $1.75 per 25; Tests VIII to XII, $2.00 per 25. No 
time limit. Administered visually and orally. 

Public School Publishing Company. 


Spelling 

The process of spelling seems not to be fully understood from a psycho- 
logical point of view. It involves memory, sensory-motor functions includ- 
ing vision, hearing, and muscular coordination, intelligence, phonics, and 
perhaps an indefinable (or at least a not-well-defined) sense of letter and 
word combinations, to name only the more obvious factors. It is intimately 
associated with reading and writing, both of which depend in part on ability 
to spell and at the same time contribute to this ability. Individuals differ 
widely in spelling ability within the same grade, the same J.Q. group, and 
the same age level. 

The measurement of spelling ability raises such questions as what words 
to test, how many, how to choose them, etc. Usually, words for this pur- 
pose are chosen from lists such as Thorndike’s The Teacher's Word Book,” 
a compilation from many sources, of words found most frequently in run- 
ning discourse, classified by frequency into the first, second, third thou- 
sands, etc., up to ten thousand. Other lists of a similar nature are also 
available, The assumption is that an educated person’s needs in spelling 
are related to the frequency with which words are found in English usage, 
and that the more difficult words are found, by and large, in the less fre- 
quently occurring groups. 

In setting up tests of spelling, a random sample of a word list or even a 
dictionary will generally provide a list which covers a wide range of diffi- 
culty and frequency of use. The difficulty can be determined by trying out 
the words with pupils of varying ages and levels of development. After 
difficulty has been determined, the words are arranged in order of difficulty 
in what is usually called a spelling scale. In testing, the words are presented 
in this order, proceeding from easiest to hardest. The scale may be seg- 
mented so that a list of 100 words is divided into a number of overlapping 
groups of perhaps 20 words each. Then the first 20 constitute the test for 
the lowest level, those from 11 to 30 the next level, and so on to the highest 
level, from 81 to 100. 

Various methods are employed to make the testing of spelling ability 
more objective than simply having the person write the word as it is pro- 
nounced. One method is to present groups of four or five different words, 
one of which may be misspelled; another is to present several spellings of 

5 E. L. Thorndike, The Teacher's Word Book (New York: Bureau of Publications, 
Teachers College, Columbia University, 1921). 
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the same word, from which the correct one is chosen. A less objective 
though probably more common practice is to pronounce the word, use it 
in a sentence, pronounce it again, and then have the pupils write it. ..For 
example: 


crowd — There was a large crowd at the game. — crowd 


Since a spelling test was described in the earlier part of this chapter in 
the discussion of survey tests, since spelling tests do not occupy a particu- 
larly important place among standardized tests, and since all spelling tests 
are essentially the same, at least for elementary grades, no further de- 
tailed description of such tests will be given. The following list may be 
useful for reference purposes. 


Spelling Tests 


1. Davis-Schrammel Spelling Test. 1935. Grades 1-9. Words chosen from 
well-known sources; 20 words per grade. i 

Forms A, B, C, D. $.30 per copy of four forms. (None needed by pupils.) 
Untimed: about 15 minutes per form. 

Bureau of Educational Measurements. 


2. Gales- Russell Spelling Diagnostic Tests. 1937. Grades2-6. Spelling Words 
Orally; Word Pronunciation Giving Letters for Letters Sounds; Spelling One 
Syllable; Spelling Two Syllables; Word Reversals; Spelling Attack; Auditory 
Discrimination; Visual, Auditory, Kinesthetic and Combined Spelling Methods. 

One form. §.75 per copy of teacher's manual and test booklet. $1.95 per 35 
test booklets. Untimed. Individual. 

Bureau of Publications. 


3. Guy Spelling Scale. Grades 2-9. 
Forms 1, 2, 3. $.25 per copy. (None needed by pupils.) 25 minutes. 
Public School Publishing Company. 


4. The New Iowa Spelling Scale. Undated. Grades 2-8. Contains 5507 
words chosen from writing of children and adults or from the Thorndike-Lorge 
Teachers Word Book. All words are said to be of high social utility. 

One form. $.50 per copy. 

Bureau of Educational Research and Service. 


5. Lincoln Diagnostic Spelling Teslis. 1942-48. Intermediate, Grades 5-8; 
Advanced, Grades 8-12. Pronunciation, Enunciation, and Use of Rules i 
Spelling. T 

Forms A, B. $2.25 per 25. 50 minutes. 

Public School Publishing Company. 


6. Morrison-McCall Spelling Scale. 1922. Grades 2-8. Eight lists of 50 words 
each ranging from easy to difficult. 3 

One form. $.25 per copy. (None needed by pupils) Untimed (about 15 
minutes). 

World Book Company. 
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7. Spelling Test. Sub-test of Coordinated Scales of Allainment. 1946. Grades 
5-8. Forty words drawn from children’s letters and checked against spelling 
books. 

Forms A, B. $1.50 per 25, plus $.75 per 25 answer sheets. 40 minutes. 

Educational Test Bureau. , 


8. Spelling Tests. National Achievement Test Series. 1939. Grades 3-4, 
50 words; Grades 5-8, 60 words. 

Forms A, B. $1.00 per 25. 25 minutes. 

Acorn Publishing Company. 


Learning Exercises 


9. What is the correlation or relationship between spelling ability and .Q.? Can 
you find any reports of studies that throw light on this question? 
10. Is it true that some people who seem otherwise competent, in and out of 
school, have difficulty with spelling? Ifso, why? Can you find any evidence on this 
question? 


Language Arts 


The concept of the language arts generally includes language skills other 
than reading, handwriting, and spelling, although the latter is often included 
in tests labeled “language arts.” Broadly speaking, the concept may also 
include oral expression and listening comprehension as expressive and 
receptive language arts. 

Aside from reading and writing, such attainments as spelling, sentence 
sense, capitalization, punctuation, grammar, and general usage are basic to 
good written expression and to oral communication as well. They are es- 
sential not only from the standpoint of good taste and established standards 
in speaking and writing, but also for clear and accurate comprehension and 
understanding of what is communicated. «This is so obvious that it needs 
no elaboration. 

Most of the tests in the area of language arts are sub-tests of a battery; 
few such tests for the elementary grades have been published separately. 
Although a language arts test has been described as part of a survey bat- 
tery earlier in the chapter, the one to be described in this section is some- 
what different in its content and approach, and it is available as a separate 


test. 


lowa Every-Pupil Tests of Basic Skills: Basic Language Skills 


l. Name of test and authors. Basic Language Skills.* H. F. Spitzer and 
Ruth Fridell, University of Iowa. (Part of series: Iowa Every-Pupil Tests of 


Basic Skills.) 
16 Quotations in test description by permission of the publisher. 
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2. Nature and purposes. A test of basic skills in punctuation, capital- 
ization, usage, spelling, and sentence sense. The purpose is to measure the abil- 
ity to apply knowledge in these areas, "in recognizing what is right and wrong in 
actual usage.” The tests are intended primarily for diagnostic testing and 
remedial instruction. 


3. Grade level. Grades 3, 4, and 5. 
4. Number of forms. L, M, N, and O. 


5. Publisher and date of publication. Houghton Mifflin Company. 
Test, 1943; Manual, 1945; Manual of General Information, 1947. 


6. Cost. $3.00 per 35 tests and answer booklets. 


7. Content. Punctuation. Consists of several exercises from which all 
punctuation has been omitted. The pupil writes in the punctuation he thinks 
necessary. 

Sample: Jacks hat is large 


Capitalization. Consists of several selections from which capital letters have 
been omitted. The pupil is told to draw a line through each letter that he 
thinks ought to be a capital. 


Sample: which girl’s name is Mary? 
Usage. Consists of 50 alternate-response items, each containing a correct 
and an incorrect usage. The pupil chooses the correct one. 
is 
Sample: The marbles large 
are 


Spelling. Consists of 40 multiple-choice items in which a word is used in a 
sentence, four possible spellings of the word being given. The pupil is to select 
the correct spelling. 

buk 


Sample: There is my a 
i boke 


Sentence Sense. Consists of 42 exercises which are to be designated as good 
sentences or not good sentences. 


Sample: R W 
John is a boy. 
R WI 
The big dog 


The items of the test are said to cover the “more tangible of the skills which 
contribute to expression in written language.” The skills in each of the five 
parts of the test are listed in detail so that a teacher can know just what skills 
or rules each part of the test is intended to measure. 


8. Time required. Part I, Punctuation, 11 minutes; Part II, Capitali- 
zation, 8 minutes; Part III, Usage, 13 minutes; Part IV, Spelling, 8 minutes; 
Part V, Sentence Sense, 6 minutes. Total: 46 minutes, plus allowance for 
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passing out and collecting papers and reading preliminary directions and direc- 
tions for each part — probably 15 minutes. The test may be given in two or 
more sittings. 


9. Directions for, and ease of, administering. The directions are clear, 
complete, and concise. The test is easy to administer if the examiner will follow 
directions. 

10. Validity. The tests are said to be based upon analysis of courses of 
study, textbooks, and instructional procedures. The identification and analysis 
of skills tested are based upon research by the authors and others. Items were 
selected on the basis of cruciality and discriminating power determined in 
preliminary tryouts with samples of from 300 to 500 pupils. All tests are 
based on pooled judgments of the authors. : 


ll. Reliability. Reliability coefficients are based on odd-even correlations 
and use of the Spearman-Brown Formula. For the separate parts, based on 
results of a fourth-grade sample drawn from 15 representative schools, the cor- 
relation coefficients are: 


L M N o 


Punctuation 83 .83 82 85 
Capitalization 87 .89 .88 .88 
Usage .68 -76 15 76 
Spelling OL RIL 88i 1591 


Sentence Sense — .73 86 —.84 A 
Total £94 96  .96  .95 


12. Manual. A separate manual accompanies each sub-test of the battery. 
These manuals give complete directions for administering and scoring the tests 
and for interpreting the scores, together with suggestions for follow-up work. 
In addition, there is a Manual of General Information which gives detailed 
information concerning the nature and purposes of the battery, the norms, 
reliability, and validity, and the use and interpretation of test results. 


13. Scoring. Directions for scoring are complete and very detailed. A 
step-by-step procedure has been worked out for division of labor in scoring. 
This may be followed by one person doing the entire job or it may be readily 
adapted to scoring with several persons participating. Scoring is done by use of 
cardboard stencils. The score on each part is the number right except on 
Part I, Punctuation, where the number of superfluous marks is subtracted from 
the number of correct ones to obtain the score. The total possible score is 225. 

The scoring of Parts I and II where the pupil inserts punctuation marks or 
indicates letters that should be capitalized is more subjective and difficult than 
the scoring of the other parts of the test. 


14. Norms. Grade equivalents are given for total scores and for each part. 
Percentile norms for each grade and for each part of the test are also provided. 
In addition, there are age equivalents for grade equivalents so that raw scores 
can be interpreted in terms of both. The age equivalent gives the age of the 
typical pupil who makes a given score on the test. These norms are based on 
nearly 200,000 pupils in Grades 3 to 8, inclusive, located in more than 350 differ- 
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ent school systems of Iowa, Illinois, Missouri, Minnesota, Nebraska, Wisconsin, 
Oklahoma, Arizona, New York, New Mexico, Montana, and North and South 
Dakota. More than two-thirds came from Iowa. 

In addition, age-at-grade norms are given for total scores on the test. These 
haye already been discussed in connection with the Slanford Achievement Bal- 
tery. They give average performance of pupils who are neither retarded nor 
accelerated. 

An individual cumulative record form may be ordered separately. The re- 
sults of six testings with the entire battery may be plotted as profiles for the 
individual pupil on this form. The profiles are based on grade equivalents for 
each part of every test. 


15. Format. Paper, printing, and arrangement on the page are good. 
Printing is large and clear enough for pupils of the grade levels for which the 
test is intended, 


* Learning Exercise * 


11. Should speech be included in the language arts? If so, what aspects of this 
subject could be measured with paper-and-pencil tests? What kinds of measuring 
techniques would be suitable, for example, in evaluating a debate as a speech ac- 
tivity? 


Other Tests of Language Arts 


1. California Language Test. Sub-test of California Achievement Tests. 1950. 
Grades: Primary, 1-4.5; Elementary, 4-6; Intermediate, 7-9; Advanced, 9-14. 
Primary: Capitalization, Punctuation, Spelling. Elementary: Capitalization, 
Punctuation, Usage, Sentence Sense, Spelling. Intermediate and Advanced: 
same as Elementary, plus Parts of Speech. 

Forms AA, BB, CC, and DD. Primary, $.05 per copy. Elementary and 
Intermediate, $.06 per copy; Advanced, $.07 per copy. Specimen set, any one 
level, $.50. Actual working time: 16-28 minutes. 

California Test Bureau. 


2. Iowa Language Abililies Test. 1948. Elementary, Grades 4—7; Inter- 
mediate, Grades 7-10. Spelling, Word Meaning, Usage, Capitalization, and 
Punctuation. Intermediate also includes Sentence Sense and Grammatical 
Form Recognition. 

Forms A and B. Elementary, $3.50 per 35; Intermediate, $4.30 per 35. 
48 minutes. 

World Book Company. 


3. Language Essentials Test. 1941. Grades 4-8; Punctuation, Capitaliza- 
tion, Sentence Structure, Correct Usage. 

Forms A and B. $1.40 per 25. 30 minutes. 

Educational Test Bureau. 


4. National Achievement Tests: English. 1944, Short Form, Grades 3-6; 
Usage, Punctuation, Capitalization, and Power to Express Ideas. 30 minutes. 
$1.50 per 25. Inclusive Form, Grades 3-8; 40 minutes. $2.00 per 25. Shorl 
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Form, Grades 6-8; Usage, Punctuation, Capitalization, and Power to Express 
Ideas. 30 minutes. $1.50 per 25. 

Forms A and B, all tests. 

Acorn Publishing Company. 


MEASUREMENT IN SOCIAL STUDIES AND SCIENCE 


In addition to the fundamental tool-subjects taught in all elementary 
grades, there is always some instruction in social studies and, to a growing 
extent, in natural science. Social studies has long been an important part 
of the elementary curriculum, but elementary science, as it is understood 
today, is relatively new — at least under that name. Its predecessor, 
nature study, had a long period of usefulness, but has now been largely.re- 
placed by a more systematic and scientific type of instruction concerned 
with helping the child to understand the natural world, and giving him an 
understanding of the methods by which science has advanced and improved 
human living. 


Social Studies 


Recent emphases in the social studies have been on (7) world geography 
and a world point of view, (2) social studies as a means of interpreting the 
past and understanding current events, (3) citizenship in a democracy, its 
responsibilities as well as its privileges, and (4) the scientific study of 
man and his civilization. As in the case of the other subjects already dis- 
cussed, these trends represent departures from the history and geography 
that emphasized learning facts about dates, discoveries, battles, boundaries, 
capitals, seaports, exports, imports, and memorizing speeches, parts of 
the branches of government, etc. Not that a knowledge of such matters is 
not important; it is, of course, necessary to know facts if one is to learn to 
think in an area such as social studies or in any other branch of instruction. 
Facts are the basis of thinking and of ideas. But it is most important that 
the process does not stop with the learning of facts. To be useful, such 
learning must proceed to the interpretation, integration, and application 
of these facts. Of what value is it if a pupil knows many facts about the 
organization of government, has memorized the Constitution and the Bill of 
Rights, but fails to vote at elections, or violates one of the articles in the Bill 
of Rights without ever relating his actions to the meaning of that article? 

Present-day textbooks and courses of study reflect the trends listed 
above to a greater or lesser degree. Certainly no modern textbook in his- 
tory, geography, or civics ignores them. Today’s pupils seem more and 
more to get out of the classroom and into the community to see how things 
actually work. The greater mobility of our age makes geography and his- 
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tory much more real and meaningful. Standardized tests reflect to some 
extent the trends in social studies, although tests tend to be somewhat more 
conservative than the best teaching, as has been noted in the case of other 
areas. The test to be described as an example is one of the more up-to-date 


and progressive ones in this field. 


California Tests in Social and Related Sciences: Social Sciences 


1. Names of tests and authors. California Tesis in Social and Related 
Sciences: Social Sludies I and Social Studies II." Georgia Sachs Adams and 
John A. Sexson. 


2. Nature and purposes. Social Sludies I includes Test 1, The American 
Heritage, and Test 2, People of Other Lands and Times. Social Studies IJ in- 
cludes Test 3, Geography, and Test 4, Basic Social Processes. 

"Test 1 deals with (A) Exploration and Colonization of America, (B) The West- 
ward Movement, (C) Later Development of the Nation, and (D) Understanding 
of Democracy. 

Test 2 deals with (A) People of Other Lands (Latin America, the Orient, and 
European countries commonly studied in elementary grades), and (B) People 
of Other Times (early civilizations of China, Egypt, and Greece), with emphasis 
on their contributions to the culture of today. 

Test 3 deals with (A) Geography of the U.S., (B) Werld Geography, (C) Map 
Reading and Knowledge of Geographical Terms, and (D) Effects of Geography 
on the Life of Man. 

Test 4 deals with (A) Food, Clothing, and Shelter, (B) Transportation and 
Communication. 


3. Grade level. The Elementary Test is designed for use in Grades 4-8. 
4. Forms. AA and BB. 
5. Publisher and date. California Test Bureau. 1946-1953. 


6. Cost. Part I, Tests 1 and 2, or Part IT, Tests 3 and 4: $.08 per copy. 
Scoreze for either part: $.07 per copy. Machine- or hand-scorable answer 
sheets: $.04 per copy. 

7. Content. Test 1, Section A, consists of 22 multiple-choice items. 

Sample: The first president of the United States was 


a Lincoln b Washington 
€ Jackson d Wilson 


Test 1, Section B, consists of 8 true-false items and 20 multiple-choice items. 


Samples: Many of the pioneers built small houses or 
shelters on their flat boats. TUNE 
When the pioneers camped at night on 
their trail to the West, they arranged their 
wagons in a circle. Their chief reason for 
doing this was to provide 
1 Quotations in test description by permission of the publisher. 
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a acentral place for eating, singing, and 
dancing 

b better protection against Indians 

c heat from a central fire 

d protection against rain 


Test 1, Section C, consists of 5 true-false items, and 15 multiple-choice items 
dealing with the Civil War and more recent events, including World War I. 

Test 1, Section D, consists of 12 true-false items, 7 multiple-choice items 
similar to the samples already given, but dealing with democracy and its prin- 
ciples. In addition to these 19 items, there are 6 items to be marked U.S., if the 
item is more often found in the United States; D, if found more often in dic- 
tatorships; and O, if the pupil thinks it is found in neither. 


Sample: Studying the laws of a city in order to sug- 
gest improvements .------------------- 


Test 2, Section A, consists of 14 true-false and 16 multiple-choice items 
dealing with life and conditions in other countries today. 

Test 2, Section B, consists of 10 true-false and 10 multiple-choice items deal- 
ing with the civilizations of ancient times and the Middle Ages. 

Test 3, Section A, consists of 35 multiple-choice items dealing with geography 
of the United States. The last 15 items are based on a map of the United 
States, and deal with the location of large cities, certain states, and some impor- 
tant bodies of water. 

Test 3, Section B, consists of 6 true-false and 24 multiple-choice items dealing 
primarily with geography of regions other than the United States. The last 12 
of the multiple-choice items are based on a Mercator projection of the world, 
and test for knowledge of locations of large cities, certain countries, and bodies 
of water. 

Test 3, Section C, consists of 20 three-response multiple-choice items testing 
knowledge of terms such as weather, longitude, etc., and ability to read a map 
with symbols for capital cities, distances, etc. 

Test 3, Section D, consists of 20 three- or four-response multiple-choice items 
testing knowledge and understanding of the effects of altitude, latitude, etc., on 
climate, crops, location of cities, etc. ‘The first six items are based on the same 
map used in Section C. 

Test 4, Section A, is made up of 8 true-false and 20 multiple-choice items 
dealing with food, how and where it is grown, and clothing and shelter under 
different conditions and times. 

Test 4, Section B, includes 18 true-false and 14 multiple-choice items on 
transportation and communication in modern times, and their importance and 
relationships. 

In all, Test 1 consists of 25 true-false and 64 multiple-choice items; Test 2, of 
24 true-false and 26 multiple-choice items; Test 3, of 6 true-false and 99 mul- 
tiple-choice items; Test 4, of 26 true-false and 34 multiple-choice items. 


8. Time required. If answers are marked on test booklets: Test 1, 33 
minutes; Test 2, 17 minutes; Test 3, 32 minutes; Test 4, 18 minutes. When 
Scoreze or separate answer sheets are used: Test 1, 40 minutes; Test 2, 20 min- 
utes; Test 3, 38 minutes; Test 4, 22 minutes. These are suggested actual work- 


. 
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ing times, and do not include time for passing out and collecting materials, 
reading directions, or answering questions. 


9. Directions for, and ease of, administering. Directions are complete 
and clear. 


10. Validity. Evidence of the validity of the tests rests on three bases: 


a, Analysis of courses of study and textbooks for content to be 
used in making preliminary test items 

b. Try-out and statistical analysis of preliminary items 

c. Rating of items by teachers and supervisors as to degree of impor- 
tance of the information or concept tested 


Items were selected for the tests on the bases of statistical criteria and ratings. 


1l. Reliability. Reliabilities (Kuder-Richardson) were computed for each 
grade, 4-8, and for each part of each test. They are given in terms of correlation 
coefficients and standard errors of measurement. The reliability coefficients 
are generally in the eighties, and, in the seventh and eighth grades, mostly in 
the nineties. Considering the restricted range, the method used for calculating 
the correlation coefficient, and the standard errors of measurement, these values 
are quite satisfactory for scores on the four tests. The user is cautioned not 
to place much reliance on scores on the sub-tests or parts of each test, since 
their reliabilities are often below the level necessary for use in individual diag- 
nosis. 


12. Manual. The manual for the tests in Social and Related Sciences is 
excellent. It provides the information needed to administer and score the tests, 
and to interpret the results. This includes the completion, on each test booklet, 
of a diagnostic profile for that pupil, based on the scores on each of twelve parts 
or sub-tests of the four tests in Social Science. Suggestions are given in the 
manual for use of the test results in appraising needs of an entire class or grade, 
and the needs of pupils transferring from another school. 


13. Scoring. The tests may be used and scored in three ways. If the pupil 
marks his answers on the test booklet, the scoring is done by hand with a printed 
strip scoring key. Regular machine-scorable answer sheets may be used and 
scored by machine or by hand. The third method is Scoreze. Scoreze consists 
of a double answer sheet with a carbon sheet in between. The pupil marks 
his answers on the top sheet. The second sheet shows the correct answers. 
As the pupil marks his answers on the top sheet, the carbon records his marks 
on the second answer sheet. After the test is completed, the sheets are sepa- 
rated and the pupil’s marks on the second sheet are compared with the correct 
answers. Scoring is simply a matter of counting the correct answers. 

The score throughout is the number right. No correction is made for chance, 
even though true-false and some three-response multiple-choice items are used. 


14. Norms. The normative population for the tests came from 11 states in 
the northern half of the United States. The population was selected to give a 
median 7.Q. of 100 with a standard deviation of 16; 70 per cent of the pupils 
were making normal progress, 20 per cent were retarded a half year or more, 
and 10 per cent were accelerated a half year or more. About 85 per cent were 
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white, the remainder consisted of Mexican, Negro, and other minority groups. 

Percentile norms are provided for each sub-test or part of the four tests in 
Social Science. There are also grade equivalents and age norms for each of the 
four tests. 


15. Format. The physical appearance, printing, arrangement, and general 
organization of these tests leave little to be desired. The booklets are attractive 
and convenient in size. Accessories include a class record sheet. 


e Learning Exercises 9 


12. Good citizenship is a generally accepted goal of social studies instruction. 
Do you know of, or can you find, any tests of citizenship? If so, describe them. 
13. Describe several different approaches to the evaluation of international 


understanding. 


Other Tests in Social Studies 


1. American History. Sub-test of Coordinated Scales of Allainment. 1946- 
1950. Grades 5-8, with separate tests for each grade. U.S. history and gov- 


ernment; world history. 
Forms A and B. $1.50 per 25. Answer sheets, $.75 per 25. 15 minutes. 


Educational Test Bureau. 


2. Emporia Geography Test. 1937. Grades 4-7. Part I: Knowledge of U.S. 
geography, tested by reference to a map. Part IT: 60 true-false questions on 
world geography. Part III: 40 multiple-choice questions on world geography. 

Forms A and B. $1.20 per 25. 30 minutes. 

Bureau of Educational Measurement. 


3. Emporia United States History Tests. 1937. Test I, Grades 5 and 6; 
Test IT, Grades 7 and 8. Test I consists of 34 true-false and 26 multiple-choice 
questions on U.S. history. Test IT consists of 64 true-false, 25 multiple-choice, 
a set of 26 matching, and five historical sequence items. 

Forms A and B. $1.20 per 25. 40 minutes. 

Bureau of Educational Measurement. 


4. Fourth Grade Geography Test. (National Council of Geography Teachers 
Geography Test) 1940. Grade 4. Knowledge of climate, main geographical 
features of Europe and Africa, and human adaptations to physical conditions. 

Form A. $.08 per copy. 30 minutes. 

McKnight and McKnight. 


5. Geography. Sub-test of Coordinated Scales of Attainment. 1946-50. 
Grades 6-8, with separate test for each grade. World geography; U.S. geog- 
raphy; climate and weather; natural resources and manufactured products; 
map reading. 

Forms A and B. $1.50 per 25. Answer sheets, $.75 per 25. 15 minutes. 

Educational Test Bureau. 
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6. National Achievement Tests: Geography (Short Form). 1938-39. Grades 
3-6 and 6-8. Geographic ideas and comparisons, and geographic facts. 

Forms A and B. $1.50 per 25. 20 minutes. 

Acorn Publishing Company 


7. National Achievement Tesis: Geography. 1939. Grades 6-8. Under- 
standing of geographic concepts; products, their values, uses; knowledge of 
important locations; appreciation of economic and human relations; ability to 
understand life situations; miscellaneous problems. 

Forms A and B. $2.50 per 25. 35 minutes. 

Acorn Publishing Company. 


8. Social Studies. Sub-test of Stanford Achievement Test. Grades 5-9. U.S. 
and world history; geography and civics. A selection from social studies tests in 
Intermediate and Advanced Batteries. 

Forms Jm, Km, and Lm. $3.15 per 35. Answer sheets, $1.20 per 35. 

World Book Company. 


Natural Science 


‘As in the case of social studies, not many standardized tests in ele- 
mentary science are available. This is probably due in large part to the rela- 
tively recent emergence of natural science as a branch of instruction in the 
elementary grades. It was mentioned earlier that nature study was for 
many years the subject taught as science at the elementary level. Nature 
study was inclined to be unsystematic, rather irregular, and unorganized, 
and within the last twenty-five years it has been largely replaced by ele- 
mentary science. ` This newcomer, however, has not yet reached the status 
of history, geography, or any of the Three R’s in the curriculum. Many 
elementary school teachers are insecure in the field of science and try to 
avoid teaching it, but the development of well-planned, well-written, at- 
tractive, scientifically sound textbooks written for children has done much 
to help elementary science progress toward the status of a regular, accepted 
member of the family of elementary school subjects. It is entirely appropri- 
ate that progress should be made in this area, for the pupil's education and 
his understanding of the world about him is surely incomplete without at 
least an elementary knowledge of science and the scientific method, and 
the elementary school is the proper place to lay the foundation for such 
knowledge and understanding. 

Since elementary science is new and not completely established in the 
curriculum, few standardized tests have been developed. This is not en- 
tirely surprising, for not all elementary schools offer science; it is not widely 
taught in every grade, or as a regular subject where it is taught. Rather, 
it is still offered in many schools on an incidental and irregular basis. 
Under such conditions, it is very difficult to develop tests that will meet 
with the approval and fit the needs of many teachers. The content must 
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be either quite general and therefore weak, or extremely limited, in order 
to satisfy any substantial number of situations. Nevertheless, a few stand- 


ardized tests are on the market. Most of them are sub-tests or tests drawn 


from parts of elementary batteries, but a few are published independently. 
Part III of the California Test of Social and Relaled Sciences is described 
below. Parts I and II have been described in the preceding section. 


California Tests in Social and Related Sciences: Related Sciences 


1. Names of test and authors. California Tests in Social and Related 
Sciences... Georgia Sachs Adams and John A. Sexson. 

2. Nature and purposes. Related Sciences includes Test 5, Health and 
Safety, and Test 6, Elementary Science. 

Test 5 deals with (A) Eating for Health, (B) Other Health Information, and 


(C) Safety Information. 
Test 6 deals with (A) The World About Us and (B) Man's Conquest of 


Nature. 


3. Grade Level. The Elementary Test is designed for use in Grades 4-8. 


4. Forms. AA and BB. 
5. Publisher and date. California Test Bureau. 1946-1953. 


6. Cost. Test booklets, $.08 per copy. Scoreze, $.07 per copy. Machine- 
or hand-scorable answer sheets, $.04 per copy. 
7. Content. Test 5, Section A, consists of 9 true-false and 16 multiple- 
choice items. 
Samples: A camel makes long journeys on the desert 
without stopping for water. T F 
One of the best sources of calcium is. 


a white bread b milk 
c oatmeal d raisin 


Test 5, Section B, includes 13 true-false afid 17 multiple-choice items dealing 
with teeth, disease, health habits, alcohol, and tobacco. 
Test 5, Section C, consists of 13 true-false and 7 multiple-choice items 


dealing with safety practices and first aid. 
Test 6, Section A, consists of 15 true-false and 25 multiple-choice items 


measuring knowledge and understanding about plant and animal life, and the 
solar system. 

Test 6, Section B, is made up of 10 true-false and 15 multiple-choice items 
concerned with man’s early efforts to control nature, and modern means of 
conserving and developing resources and power. 

8. Time required. If answers are marked on test booklets: Test 5, 21 
minutes; Test 6, 19 minutes. If Scoreze or separate answer sheets are used: 
Test 5, 27 minutes; Test 6, 23 minutes. These suggested times are actual work- 
ing times only. 

18 Quotations in test description by permission of the publisher. 
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9. Directions for, and ease of, administering. Directions are complete 
and clear. 
10. Validity. As discussed under Social Sciences test, page 198. 


ll. Reliability. See page 198. 
]2. Manual. See page 198. 
13. Scoring. See page 198. 
14. Norms. See pages 198-9. 
15. Format. See page 199. 


e Learning Exercises 9 


14. "There are few standardized tests of elementary science. How do you account 
for this? 

15. Can children in the primary grades learn to solve problems by use of the sci- 
entific method? If your answer is “‘yes,” give some examples. If “no,” justify your 
answer. How would you test for this? 


Other Tests in Science 


1. Elemenlary Science. Sub-test of Coordinated Scales of Attainment. 1946. 
Grades 5~8. Contains items dealing with astronomy, physiology, earth science, 
physics, and, to a limited extent, biological science. 

Forms A and B. $1.50 per 25; answer sheets, $.75 per 25. 15 minutes. 

Educational Test Bureau. 

2. Science Test» Sub-test of Stanford Achievement Test. 1952. Grades 5-9. 
Content selected from science tests of intermediate and advanced complete bat- 
teries. Includes a selection of items from physical sciences and biological 
Sciences. è 

Forms Jm, Km, and Lm. $2.60 per 35. Answer sheets, $1.20 per 35. 

World Book Company. 


3. National Achievement Tests: Elementary Science. 1953. Grades 4-6. 
Practical applications, cause and effect relationships, miscellaneous facts in 
Science. 

Form A, $2.50 per 25. 30 minutes. 

Acorn Publishing Company. 


Annotated Bibliography 


1. Buros, Oscar K. (ed.). The Fourth Mental Measurements Yearbook. Highland 
Park, N.J.: The Gryphon Press, 1953, (See also The Third Mental Measurements 
Yearbook and The Nineteen Forty Mental Measurements Yearbook.) The most com- 
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plete and useful review of measuring instruments and books on measurement in 
education and psychology. Reviews give factual information as well as reviewers' 
criticisms. Reviews in earlier editions are cited in each of the subsequent Yearbooks. 


2. Committees on Test Standards of the American Educational Research Asso- 
ciation and the National Council on Measurements Used in Education. Technical 
Recommendations for Achievement Tests. Washington, D.C.: American Educational 
Research Association, 1955. An authoritative statement concerning the kinds of 
information test manuals should supply. Written primarily with test authors and 
publishers in mind, but a useful reference for students. 


3. Greene, Harry A., Jorgensen, Albert N., and Gerberich, J. Raymond. Meas- 
urement and Evaluation in the Elementary School, Second Edition. New York: 
Longmans, Green and Company, 1953. Chapters 15-22. The most complete dis- 
cussion, in textbooks of this kind, of achievement tests for elementary schools. Ap- 
proximately one-third of the volume is devoted to a discussion of objectives and 
measurement procedures in each of the common branches of instruction. Reference 
is made almost. wholly to published, standardized tests and the problems of devel- 
oping such instruments in each subject. 


4. Jordan, A. M. Measurement in Education. New York: McGraw-Hill Book 
Company, Inc., 1953. Chapters 5-13. Contains nine chapters dealing with meas- 
urement in reading, spelling and handwriting, language and literature, social sci- 
ences, foreign languages, mathematics, science, business education, fine arts, manual 
arts, physical education, and health, respectively. Each chapter discusses ob- 
jectives and describes tests for elementary grades and high schools in each of the 
respective fields. 


5. The Measurement of Understanding. Forty-Fifth Yearbook of the National 
Society for the Study of Education, Part I. Chicago, Ill.: The University of Chicago 
Press, 1946. 338 pp. Describes and gives excerpts from many standardized tests in 
social studies, science, mathematics, language arts, fine arts, health education, 
physical education, home economics, agriculture, technical education, and industrial 
arts. ‘Also gives many helpful suggestions for the improvement of locally made 
tests in these areas. 


6. Torgerson, Theodore L., and Adams, Georgia Sachs. Measurement and Evalua- 
alion for the Elementary School Teacher. New York: The Dryden Press, 1954, Chap- 
ters 11-16. Contains six chapters dealing with measurement in reading, oral and 
written communication, handwriting and spelling, arithmetic, social studies and 
science, and fine arts at the elementary grade level. In each area there is a discussion 
of methods of teaching and diagnosis, and a discussion of methods of measurement 
and evaluation. 

7. Wrightstone, J. Wayne, Justman, Joseph, and Robbins, Irving. Evalualion 
in Modern Education. New York: American Book Company, 1956. While the book 
is devoted mainly to evaluative procedures, it contains one chapter dealing very 
briefly with measurement of achievement in social studies, natural sciences, music 
and art, foreign languages, industrial arts and business education, and one devoted 
to measurement in language arts and mathematics. 


See also catalogs of the test publishers listed in Appendix B. 
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Measuring Achievement in the 


Secondary Grades 


A ———— 


The problems of producing standardized achievement tests for secondary 
school classes are somewhat different from those encountered by the maker 
of tests for the elementary grades. In the first place, there is not such a 
well-established core in the high school program as there is at the elemen- 
tary level. The most common areas of study in high school are English, 
social studies, science, and mathematics. Practically all high school pupils 
take courses in English and social studies, but content and emphases vary 
widely. For instance, social studies courses may include anything from 
civics to ancient history, and may be taught ina year’s time or over a period 
of four years. About half the secondary school population takes a course in 
science, and while most pupils take some mathematics in high school, many 
graduate without taking any. 

No other academic subjects in the high school curriculum are as common 
as those mentioned, yet there is a great variety of other subjects offered in 
most schools; all of these subjects provide an opportunity to produce many 
different tests, Nor does the diversity in high school subject areas end with 
the names of courses; it extends also to objectives and content. Here 
again, there is a contrast with the situation in elementary grades, where 
objectives and content are probably much more standardized. An extreme 
example of such diversity in secondary school subjects is in the field of liter- 
ature, where it is practically impossible to construct a test whose content 
will satisfy all or a majority of English teachers. In varying degrees, this 
same principle applies in other subject matter areas. 

Makers of tests for secondary grades usually attempt to solve this prob- 
Jem in one or both of two ways. The first and more usual practice is to 
pase the tests on the common essentials as determined by an analysis of 
Jeading textbooks and courses of study, putting into the tests only such 
content and skills as are found in all or nearly all such curricular material. 
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The second approach is to make a test that measures ability to use the 
knowledge and skills learned, rather than just the knowledge and skills, 
per se. A test of this sort might include measures of a wide variety of objec- 
tives besides knowledge, such as study skills, or the ability to apply or 
interpret knowledge. 

While no sharp line of demarcation can be drawn to identify all current 
standardized tests with one type or the other, the tendency is away from 
the currently predominant first type, and toward the second, which is the 
functional approach. Most modern tests for high school subjects lean in 
this direction, though they nearly always have a definite content orientation 
as well. This combination is certainly not undesirable and is probably 
necessary, for it is difficult to conceive of anyone interpreting and applying 
knowledge which he does not possess. On the other hand, it does seem 
possible to have knowledge without being able to make use of it. Hence, 
both types of objectives are valuable, and an ideal test should measure 
both. 

Time is another factor which complicates the situation for the maker 
and user of standardized tests in high schools. The elementary school 
program is much more flexible in this respect than that of the high school. 
Usually, the elementary teacher remains all day in the same room with the 
same pupils, and if she wishes to devote several hours of one day to a par- 
ticular activity, whether this be testing, classwork, or a field trip, the matter 
of time presents no serious problems. There are no fixed periods and no 
rigid schedules; generally speaking, no other teachers must be consulted 
about a change in procedure for the day. The high school program, on the 
other hand, is nearly always on a fixed schedule of class periods, averaging 
forty-five minutes in length. To detain a given group or class for more 
than one period for testing or other reasons on any given day nearly always 
requires consultation with others, adjustments in schedules, and approval 
of arrangements by several persons. As a result, high school tests some- 
times seem to be constructed to fit a time schedule rather than to adequately 
measure important instructional goals. Most batteries, of course, are 
designed so that they can be given in several sittings, and this arrangement 
makes possible adherence to a school schedule and yet provides reasonably 
adequate measurement in each subject. 

Finally, since there is such a wide variety of individual study patterns 
and subjects in the high school, most tests must include items over a wide 
range of difficulty and content in order to satisfy a majority. To formulate 
atest which is inclusive and which samples an area adequately poses difficult 
problems for those who construct or use standardized tests at the secondary 
level. The situation is reflected by high school teachers in their most fre- 
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quently voiced criticism of standardized tests, namely, that “they don’t 
measure what I teach.” * 

In this chapter, as in the preceding one, the study of the types of achieve- 
ment tests will depend principally on prototypes or examples. There are 
a great many standardized tests for high school subjects, especially in the 
common or core subjects. It would be impossible to describe and discuss 
even a substantial proportion of them in a volume such as this. However, 
lists of tests will be found at the end of each section; these briefly note such 
matters as date of publication, cost, content, publisher, number of forms, 
and time required. 

Survey batteries are discussed in the first section, then tests in the specific 
subject areas of English, social studies, science, and mathematics. The 
annotated bibliography at the end of the chapter lists a number of books 
which include discussions and descriptions of tests in many other fields. 


SURVEY BATTERIES 


The problems that have just been discussed are of particular importance 
to teachers and others who construct survey batteries for use in the high 
school grades. The variety of courses, the fixed schedules, and the wide 
range in content and difficulty all affect the survey battery even more 
than they affect a test in a single subject, and these factors provoke such 
questions as the following: To which fields or subject-matter areas should 
such a battery be confined? How extensively can the battery sample each 
area, and how simple must some of the exercises be in order that every 
-pupil may find something at which he can succeed? How wide an area can 
the battery cover before it becomes so thin and superficial that its users 
will have no confidence in it? Above all, how can an adequate job of 
measurement be done in a time short enough to enable schools to use the 
instrument? 

The authors of most high school survey batteries meet these questions 
and situations by confining the tests to the subjects of English, social 
studies, science, and mathematics. When they venture beyond these areas 
of knowledge, it is usually to measure interpretive and applicative abilities 
related to the content areas. Also, test authors usually attempt to include a 
wide enough range of difficulty in each subject to measure achievement of 
those pupils who have had from one to eight or more semesters in that area. 
As a result, the tests are sometimes too difficult for the beginners, or too 
easy for the bright twelfth-graders. 

While conditions and needs at the secondary level vary greatly, they 


1 Victor H. Noll and Walter N. Durost, “Measurement Practices and Preferences of 
High School Teachers," Test Service Notebook No. 8 (Yonkers-on-Hudson, N.Y.: 
World Book Company). 
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have enough in common to make the survey battery useful for certain 
purposes. As a measure of a pupil's basic orientation in the broad fields 
already mentioned, the survey battery is rapid and fairly accurate. Obvi- 
ously, it does not give adequate measurement in any one subject such as 
American history or trigonometry, yet it does reveal inequalities in a 
pupil’s background and achievement among the fields covered, and provides 
a basis of comparison of the achievement of individuals and groups with that 
of other secondary school pupils. The survey battery is also a useful tool 
for the guidance worker seeking to assist pupils in making decisions about 
courses and curriculums. It should be remembered that test batteries dif- 
ferentiate only roughly and should always be used with full regard for their 
limitations and in conjunction with other more sensitive measuring in- 
struments. 


lowa Tests of Educational Development 


The high school survey battery to be described here as an example was 
first published in`1942. In this first edition, the tests were rented and not 
sold, and no separate forms of the individual tests were available. Therental 
at $.75 per test included scoring, tabulating, and analysis of results. 
The tests were revised in 1948 and have since been available for purchase 
both as a complete set in one booklet of fifty-six pages which is scored by the 
publisher, and as separate booklets, one for each of the nine sub-tests. The 
separate booklets are scored manually by the user. 

This survey battery represents an outstanding example of an attempt to 
use the functional rather than the content-centered approach. Other survey 
batteries for high school use, some of equal quality, are listed at the end of 
this section. 


1. Names of tests and authors. Towa Tests of Educational Development? 
E. F. Lindquist, General Editor. Test 1: Understanding of Basic Social Con- 
cepls, J. W. Maucker. Test 2: General Background in the Nalural Sciences, 
Robert L. Ebel. Test 3: Correciness and Appropriateness of Expression, John 
Gerber. Test 4: Ability to Do Quantitative Thinking, Paul Blommers. Test 5: 
Interpretation of Reading Malerials in the Social Studies, K. W. Vaughn. 
Test 6: Interpretation of Reading Materials in the Natural Sciences, K. W. 
Vaughn. Test 7: Interpretation of Literary Materials, Julia Peterson. Test 8: 
General Vocabulary, K. W. Vaughn. Test 9: Use of Sources of Information, 
K. W. Vaughn. All authors are now or were formerly associated with the State 
University of Iowa. 

2. Nature and purposes. The general manual for the battery states that 
“These tests are not based on an analysis of individual high school subjects. 
Rather, they are designed to measure relatively broad and generalized intel- 
lectual skills and abilities that are continuously developed in every student 
throughout all of the years that he is in school. This design makes the tests 


? Quotations in test description by permission of the publisher. 
€ 
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appropriate for administration to all high school students in all grades regard- 
less of the subjects they are taking. 

“The battery as a whole is concerned not so much with what the student 
knows as with what he can do. The student's performance on these tests indi- 
cates not only how much he has accomplished to date, but also how much he 
can profit from further instruction, or how well he is prepared to continue his 
own education. The major purposes of the Iowa Tests are: 

“First, lo enable teachers and counselors to keep themselves more intimately and 
reliably acquainted with the educational development of each high school student. 
Such knowledge makes it easier to adapt instruction and guidance to each 
student's peculiar and changing needs. 

“Second, lo provide the school adminislralor with a more dependable and ob- 
jective basis for evaluating the total educational offering of the school. The test 
results point up any need for curriculum revision that may exist. They also 
facilitate a wiser distribution of supervisory efforts.” 


3. Grade level. The manual states that the battery is designed for Grades 
9-12, though the test booklets themselves are marked 9-13. The tests prob- 
ably have enough range to make them usable with college freshmen. 


4. Number of forms. One form, Y-2, except for Test 5, which is now 
available in two forms. 


5. Publisher and date of publication. Science Research Associates, 
1951. 


6. Cost. (a) Each test, 1-9: — 20 test booklets ..............-.. $3.00 
20 self-scoring answer pads ....... 2.00 
100 machine-scoring answer sheets. 4.00 
Complete specimen set ........... 3.00 
Specimen of each test ............ .90 


(b) Single booklet containing all tests and including scoring 
service: prices upon request. 


7. Content. Test 1, Understanding of Basic Social Concepts. Consists of 
90 multiple-choice items, “designed to measure general knowledge and under- 
standing of contemporary social institutions and practices.” 

Sample item: 
What is meant by a “left wing" political party? 
a. A party which has recently been defeated 
b. A party which wants to make rapid changes 
c. A party which has only a few members 
d. A party which makes a lot of noise but has no 
real power 


Actual working time on Test 1, 55 minutes. This does not include time for 
passing out booklets, reading directions, etc. 


Test 2, General Background in the Natural Sciences. Consists of 90 multiple- 
choice items, “designed to measure general knowledge of scientific terms and 
principles, of common natural phenomena and industrial applications, and 
of the place of science in modern civilization.” 
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Sample item: 
What determines how much heat is used in changing a solid 
metal ball into liquid metal? 
a. Only the size of the ball 
b. Only the material from which the ball is made 
c. Both the size and the material 
d. The size, the material, and the temperature of 
the applied heat 


Actual working time on Test 2, 60 minutes. 


Test 3, Correctness and Appropriateness of Expression. Part I consists of 
a letter and three prose selections, each containing errors of expression and 
inappropriate expressions. The pupil is to identify these and indicate how 
each may be corrected or improved. There are 88 items in this part. 


Sample ilem: 
He would not be able to see nothing. 
a. no change 
b. nowhere 
c. anything 


Part II consists of 15 spelling items. 


Sample ilem: 
a. laboratory 
b. petroleum 
c. advertisement 
d. miscellaneous 
e. none wrong 


Actual working time on Test 3, 60 minutes. 


Test 4, Ability lo Do Quantitative Thinking. Consists of 53 multiple-choice 
items designed to measure “general mathematical ability." It draws on in- 
formation gained in courses other than mathematics, and deals with practical 
quantitative problems which “every high school graduate should be able to 
solve." z 

Sample item: 

A home worth $10,000 is insured by an ordinary fire insurance 
policy for $6,000. It is damaged by fire to the extent of $2,000. 
How much insurance should the company pay the owner? 

a. The amount of the policy, $6,000 
b. Three-fifths of the damage, $1,500 
c. The amount of the damage, $2,000 
d. The value of the home, $10,000 

e. Correct answer not given above 


Actual working time on Test 4, 65 minutes. 


Test 5, Interpretation of Reading Materials in the Social Studies. Consists of 


80 multiple-choice items whose purpose is to measure "ability to interpret and 
evaluate reading selections taken from social studies textbooks and references, 
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from magazine and newspaper articles on social problems, and from the litera- 
ture of the social studies in general.” This includes ability to understand what 
is stated and what is implied in a selection, and ability to evaluate it critically. 
Sample item (following a paragraph on social insurance) : 
With what social issue is this paragraph concerned? 
a. Should there be government ownership of life insurance 
companies? 
b. Should the government make life insurance compulsory? 
c. Should the government be responsible for the economic 
security of its citizens? 
d. Should the government make it easier to effect reforms? 


Actual working time on Test 5, 60 minutes. 


Test 6, Interpretation of Reading Materials in the Nalural Sciences. “This 
test measures ability to interpret and evaluate reading materials selected from 
textbooks and references used in the natural sciences, from scientific articles in 
newspapers and periodicals, and from relatively non-technical or popular and 
semi-popular scientific literature in general.” This includes the “ability to 
understand what is stated in a selection,” and "ability to evaluate a selection 
critically." The test consists of 81 multiple-choice items based on ten selections. 


Sample item (following a selection of approximately 450 words dealing 
with mimicry in animals and plants): 


Which of the following may we infer is especially favorable to 
the development of mimetic resemblances in insects? 
a. The effect of similar functions 
b. Bright coloration 
c. Small size 
d. Short life 


Actual working time on Test 6, 60 minutes. 


Test 7, Interpretation of Literary Materials. This test is intended to 
cover “most of the measurable understandings that high school students 
derive from the reading of literary materials.” Included are such elements 
as understanding of detail, comprehension of characterization, recognition of 
mood, tone, emotion, writer’s purpose or viewpoint, imagery, figures of speech, 
grasp of main thought(s) of a passage, and awareness of outstanding qualities 
of style or structure. The test does not purport to measure appreciation which 
is subjective and varies from reader to reader. 

The test consists of seven prose selections and four of poetry, on which 80 
multiple-choice items are based. 


Sample ilem: 


Loveliest of trees, the cherry now 

Is hung with bloom along the bough, 
And stands about the woodland ride 
Wearing white for Eastertide. 


What feeling does the poet express in this passage? 
a. Delight in beauty 
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b. Religious faith 
c. Fear of death 
d. Enjoyment of old age 


Actual working time on Test 7, 50 minutes. 


Test 8, General Vocabulary. ‘The test is intended to measure (1) general ap- 
titude for school work and (2) recognition of word meaning. Eighty per 
cent of the words were selected from the Thorndike Century Senior Dictionary 
basic list of 20,000 words. The remainder are in this dictionary but not in the 
list. The test consists of 75 multiple-choice items of which the following is a 
sample: 

a. formal 

b. highly decorated 
c. plain 

d. bleak 


Actual working time on Test 8, 22 minutes. 


Tts ornale exterior 


Test 9, Use of Sources of Information. “This test measures the student's 
familiarity with important sources of information and his ability to use them. 
Specific skills measured include (7) knowledge of the nature and purposes of the 
major types of sources of information; (2) knowledge of the specific contents 
of the more common sources such as dictionaries, encyclopedias, and year- 
books; (3) ability to select the source most appropriate to use in a given situa- 
tion; (4) ability to interpret bibliographical references; (5) ability to use a card 
index efficiently.” 

The test consists of 65 multiple-choice items, such as the following: 


Where would you look to find the location of Orange County, 
Indiana? 
a. In an encyclopedia. 
b. In an atlas. 
c, On a globe. 
d. In a gazetteer. 


Actual working time on Test 9, 27 minutes. _ 
8. Time required. See above. 


9. Directions for, and ease of, administering. The tests are easily ad- 
ministered. All are preceded by a sample item marked on the answer sheet, 
so that the pupil can see one question answered and marked correctly. The 
directions for administering each test are clear and complete, and include all 
necessary details regarding room requirements, materials needed, use of answer 
sheets, proctors, and timing. There are no sub-parts requiring separate timing 
in any test. 


10. Validity. A large number of items were prepared for each test and, on 
the basis of preliminary tryouts on 3,500 pupils in eight Iowa high schools, 
discriminating power and difficulty of each item were determined. 

Correlations between scholarship average in the first year of college and 
four measures, including composite score on the ITED, were calculated for 
282 Towa high school seniors who took the tests in 1946 and the next fall en- 


212 Measuring Achievement in the Secondary Grades 


tered one of the three state institutions for higher education in Iowa. The 
correlations were as follows: 


High school grade point average with college freshmen 


AVELABE. 2... eee eens 61 
Percentile rank in high school graduating class 
with college freshmen average. ..... casos 58 
Composite score on ITED at entrance to 9th grade 
with college freshmen average. ............... 91 
Composite score on ITED at entrance to 12th grade 
with college freshmen average...............- .62 


Thus it appears that the results of the 7 TED are as good predictors of success 
in first-year college, as measured by marks, as the record of four years in high 
school. 

The average of the correlations between each of the nine tests and each of 
the other eight, and the composite, is 71. This seems to suggest that the nine 
tests are measuring substantially the same things, though the manual states 
that this coefficient "constitutes objective and conclusive evidence that the 
various tests in the Iowa battery really do measure different things.” 

Finally, it is stated that the ultimate test of validity is whether or not the 
tests measure what the prospective user considers to be desirable outcomes 
of a program of general education. It is suggested that he take the tests and 
then decide for himself whether they are valid or not. In a sense, this implies 
either that the majority, at least, would have substantially the same ideas of 
desirable outcomes, or that the tests can be all things to all men. The former 
seems clearly the only practical alternative. 


1l. Reliability. Reliability coefficients based on correlations between 
split-halves (odd-even items) and corrected by the Spearman-Brown Formula are 
reported in the form of the average coefficients for separate grade groups within 
each school. This tends to restrict the variability or range and lower the co- 
efficients. Nevertheless, all the average reliability coefficients are between .81 
and .94. Most are in the neighborhood of .88 to .92, which is quite satisfac- 
tory. The probable error of measurement of a single standard score is re- 
ported as one point. 3 


12. Manuals. There is a general manual for the entire battery giving in- 
formation on development and standardization, and other important facts 
about the tests. A separate manual for each test outlines the purposes of the 
test and provides sample items and selected references, together with directions 
for administering and scoring the tests and for interpreting and using the re- 
sults. All manuals are clearly written and complete. 


13. Scoring. The tests may be scored manually or by machine. Manual 
scoring is made easy by use of carbon answer pads which eliminate the need for 
scoring keys, though they add to the cost. All items are in multiple-choice 
form and scoring is entirely objective. There is no correction for chance, the 
score on each test being the number right. 

Scores on each of the nine tests can be converted into standard scores based 
on the same standard scale and the scores on the different tests are said to be 
“highly comparable.” Moreover, the standard scores are said to be comparable 
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from year to year, that is, from grade to grade. The sum of the scores on Tests 
1 to 8, inclusive, may also be converted to a composite standard score which is 
based on the same standard scale as that for the separate tests. 


14. Norms. What has been said above in relation to scores also has a bear- 
ing on norms since the standard scores are, in a sense, norms of achievement. 
Percentile norms, based on school average standard scores and on composite 
standard scores, are given for each test. A “Confidential Summary Report” 
booklet which must be ordered or requested separately interprets these norms 
and gives suggestions for their use. The norms are based on approximately 
50,000 pupils in 290 high schools, most of them in Iowa. It is claimed, however, 
that these are representative of the national high school population on the 
basis of data collected by giving Tests 3-7 to scientifically selected, nation-wide 
samples of 30,000 graduating seniors in 1943-44, The results paralleled quite 
closely those based on graduating seniors from Iowa high schools. 


15. Format. The tests are well printed on good paper. Manuals, tests, 
and other accessories are well organized and arranged. The tests with self- 
scoring answer pads constitute a useful and practical survey battery for the 
secondary level, of the type described as primarily functionally oriented. It is 
considerably more expensive to give the nine tests of this battery than some 
other batteries. The tests cost $.15 per copy; the carbon answer pads, $.10 
per copy. With shipping costs added, this comes to more than $.25 per pupil 
per test. Thus, the battery costs over $2.25 per pupil if the separate test 
booklet edition is used. This is more than most schools will feel they can af- 
ford to spend on this phase of the testing program. However, the tests have 
many excellent features and advantages which make them worth some addi- 


tional cost. 


Other Survey Batteries 
1. California (formerly Progressive) Achievement Tests. Grades 9-14. 
Advanced Battery of the series, 1950. Reading Vocabulary, Reading Com- 
prehension, Arithmetical Reasoning, Arithmetic Fundamentals, Mechanics of 
English and Grammar, Spelling. Forms AA,BB, CC, $.14 per copy, singly or 
in packages of 35; $.50 per specimen set; 150 (165) minutes. 
California Test Bureau. 


2. Cooperative General Achievement Tests, Revised Series. Grades 12-13. 
1947-51. Separate tests of Social Studies, Natural Sciences, and Mathe- 
matics; Forms XX and YZ, $2.75 per 25; answer sheets, $.90 per 25; $.50 per 
specimen set. 40 (45) minutes. 

Educational Testing Service. 


3. Essential High School Content Battery. Grades 9-13. 1950. Mathematics, 
Science, Social Studies, and Language and Literature. Forms Am, Bm, $8.40 
per 35; answer sheets, $2.25 per 35; $.50 per specimen set. 205 (225) minutes. 

World Book Company. 


4. Ohio General Scholarship Test for High School Seniors. Grade 12. New 
form annually. English, History, Mathematics, Science, Reading; $.10 per 
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copy or $.60 per pupil, including scoring, analysis, and reporting of results. 
Ohio State Department of Education. 


5. Tesls of General Educalional Development. 1944. Grade 12 and Adults. 
Correctness and Effectiveness of Expression, Interpretation of Reading Ma- 
terials in the Social Studies, Interpretation of Reading Materials in the Natural 
Sciences, Interpretation of Literary Materials, General Mathematical Ability, 
each in a separate booklet; one form, $2.50 per 25; $.50 per specimen set. 120 
minutes per test, Separate answer sheets must be used. 

American Council on Education. 


ACHIEVEMENT TESTS IN SPECIFIC SUBJECTS 


English 


Many standardized tests in this subject are available for junior and sen- 
ior high school grades. There are tests in the fundamentals of English 
grammar, sentence structure, capitalization, and punctuation; tests to 
measure knowledge and appreciation of literature; scales to measure compo- 
sition or writing ability; and tests of vocabulary and spelling ability. As 
before, only one test will be described in detail as an example in its field 
and others will be listed for reference. This list is not complete or exhaus- 
tive, but it includes representative examples of the variety of tests avail- 
able in the broad field of English. 


Cooperative English Test 


1. Name of test and authors. Cooperative English Test.2 Authors: Janet. 
Afflerbach and others. 


2. Nature and purposes. A comprehensive test of fundamental areas of 
competence in English; namely, Mechanics of Expression, Effectiveness of 
Expression, and Reading Comprehension. The three tests are available in 
separate booklets so that any one'or two, or all three, can be given as desired; 
there is also a single-booklet edition containing the three tests. Only the first 
two will be described here. 


3. Grade level. Lower Level, Grades 7-12. Higher Level, superior stu- 
dents in Grades 11 and 12, and college students. 


4. Number of forms. Forms T, Y, RX, and Z of the single-booklet edi- 
tion. Form T, X, Y, and Z of Test A, Mechanics of Expression, and Test B, 
Effectiveness of Expression. 


5. Publisher and date of publication. Cooperative Test Division of Ed- 
ucational Testing Service, 1953. 


6. Cost. Single booklet edition, $4.95 per 25. Answer sheets, $1.90 per 25. 


fovea BMC Re from ea English Test, by Janet Afflerbach and 
others. Copyright y Educational Testing Service. Reprinted by special i 
sion of Educational Testing Service. 2 S Y 
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Test A, Mechanics of Expression, $2.25 per 25. Answer sheets, $.90 per 25. 
Test B, Effectiveness of Expression, $2.50 per 25. Answer sheets, $.90 per 25. 


7. Content. Test A, Mechanics of Expression, Part I, Grammatical Usage, 
contains 60 sentences with four parts of each one underlined. The student is 
to decide if there is an error in usage in any of the underlined parts and, if so, 
fillin the corresponding number on the answer sheet. If there is no error, that is 
to be indicated also by filling in the space under “‘0,” as shown in the samples. 


I. Hesays that he ain't coming home with us today. 
1 2 3 4 


II. She isn't ready to go home. 
RED 314 


I 


Part II, Punctuation and Capitalization. This consists of nine passages 
which the student is to punctuate, as follows: 
I (1) N (2), @) 3 
We came home yesterday 


I 
Il (4) N (3). @) ? 


Then follow three passages in which capitals are to be indicated where ap- 
propriate. 


Sample: 
His name is henry. I C s 
I II H 


Il 


Part III, Spelling, consists of 30 groups of words, four words per group, in 
each of which one word or none are misspelled. The task is to identify the mis- 
spelled word or indicate that none is incorrectly spelled. 


Sample: 1-1 measurement 
1-2 hesitate 
1-3 pursuit 
1-4 aviater 
1-5 none wrong 


Test B, Effectiveness of Expression, Part I, Sentence Structure and Style, con- 
sists of paired passages from which the better version is to be chosen, and four 
sets of four sentences in each of which the most satisfactorily expressed sen- 
tence is to be identified. 


Sample: 
A-1 A-2 
A Kansas City boy learned One of the rules a Kansas 
a rule that it is always City boy learned in his 
best for one to go to the high-school civics class was 


highest authority whenever that it is always best to 


« 
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a problem arises in his go to the highest authority 
high school civics class. with a problem. 


Part II, Diction, consists of 20 multiple-choice items to test choice of words. 
In each case the word that would most suitably complete the sentence is to be 
indicated. 


Sample: 
In what year was the Mississippi River (_-------------- »? 
1-1 discovered 
1-2 invented 
1-3 encountered 
1-4 originated 
1-5 detected 


Part III, Organization, consists of (a) groups of sentences, three to five in 
a group, which are to be arranged in the best order, and (b) an outline of a 
process from which parts have been omitted. 


Samples: 
A. One must be everlastingly at it. 
B. It is not easy to write well. 
C. Practice is most important. 


1. If the three sentences above were re-arranged in the best order, sen- 
tence A would be placed 

l. first 

2. directly after B 

3. directly after C 
2. Sentence B would be placed 

l. first 

2. directly after A 

3. directly after C. 
3. Sentence C would be placed 

l. first 

2. directly after A 

3. directly after B 


How ro Make Jetty 


I. Preliminary steps 
A. (-23-) 
1. Fruit which jells easily 
2. Fruit not at peak ripeness 


23. In filling in the incomplete outline above, which one of the 
following should you use for A under heading I? 
23-1 Varieties of fruit available 
23-9 Selecting fruit 
23-3 Amount of fruit to use 
23-4 Equipment needed 
23-5 Nutritional values 
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8. Time required. 
Mechanics of Expression: 
Part I, Grammatical Usage, 15 minutes 
Part II, Punctuation and Capitalization, 15 minutes 
Part II, Spelling, 10 minutes 


Effectiveness of Expression: 

Part I, Sentence Structure and Style, 15 minutes 
Part II, Diction, 10 minutes 

Part III, Organization, 15 minutes 


9. Directions for, and ease of, administering. Directions are concise and 
clear. Tests are simple and easy to administer, with a minimum of participa- 
tion by the examiner. 


10. Validity. Validity of the test appears to rest solely on the competence 
of the authors, their knowledge of the fundamentals of English, and their 
ability to construct test items to measure these fundamentals. In general, the 
Cooperalive Tests are constructed by carefully chosen specialists who presum- 
ably know the literature of their respective fields and who are skilled test 
technicians. The average user of the tests would have a slightly more com- 
fortable feeling about them if the producers would submit evidence on some of 
the conventional criteria of validity such as correlations with other measures, 
validity indexes, etc. There is some statistical evidence of validity gathered 
by users of the tests subsequent to publication in the form of correlations with 
marks in English courses in college. 

1l. Reliability. Coefficients of correlation between a regular form (Y 


and Z) of the test and a parallel short form given to the same students range 
from .86 to .92 for Tests A and B. Standard errors of measurement range from 


2.5 to 3.9. 
12. Manual. There are no separate manuals for the individual Cooperative 


Tests, nor any booklets summarizing data available on all of them. Adequate 
directions for administering and scoring each test are part of the test itself, and 


of the scoring key. : 

13. Scoring. Separate answer sheets are required for Form Z; their use is 
optional with other forms. The tests are scored with a fan or strip key by 
hand; answer sheets are scored with a scoring stencil by hand, or may also be 
scored by machine. 

14. Norms. Scaled scores, essentially standard scores with a mean of 50 
and a standard deviation of 10, are provided. Percentile rank tables for each 
grade level are also available. 

15. Format. The Cooperalive Tests are generally among the best in format 
and printing. They are printed on high-grade paper, with ample spacing, good 
typography, and careful attention to detail. 


Other Tests in English 
1. Barrett-Ryan-Schrammel English Test. New edition, 1956. Grades 9-13. 
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Functional Grammar, Punctuation, Parts of Speech, Parts of a Sentence, Sen- 
tence Elements, Vocabulary, and Pronunciation. 

Forms Om and Em. $3.25 per 35. 60 minutes. 

World Book Company. 


2. California Language Test. Sub-test of California Achievement Test. 1951. 
Intermediate, Grades 7-9; Advanced, Grades 9-14, Mechanics of English and 
Grammar, and Spelling. 

Forms AA, BB, CC, and DD of Intermediate. $.06 per copy. 27 minutes. 

Forms AA, BB, and CC of Advanced. $.07 per copy. 30 minutes. 

California Test Bureau. 


3. Center-Durost Literature Acquaintance Test. 1951. Grades 11-13. Tests 
recognition of excerpts from selections in the recommended Reading List of the 
National Council of Teachers of English. 

Form Am. $3.40 per 35. Answer sheets, $1.35 per 35. 40 minutes. 

World Book Company. 


4. Cooperative Dictionary Test. 1951-52. Grades 7-12. Tests ability to use 
a dictionary in finding various kinds of information quickly and accurately. 
Also tests knowledge of alphabetization, pronunciation, derivation, spelling, 
and meaning of words. 

Form A. $1.75 per 25. Answer sheets, $1.35 per 25. 30 minutes. 

Educational Testing Service. 


5. Cooperative English Test. (Usage, Spelling, Vocabulary.) 1950. Grades 
7-12, and college. Grammar, punctuation, capitalization, sentence structure, 
spelling, and word knowledge are tested. 

Forms OM and PM. $2.95 per 25. Answer sheets, $1.00 per 25. 70 minutes. 

Educational Testing Service. 


6. Cooperative Literary Comprehension and Appreciation Test. 1935-51. 
Grades 10-12, and college. Measures grasp of content, perception of author’s 
viewpoint, recognition of literary devices, and appreciation of style and rhythm 
in poetry and prose selections. N 

Forms R and T. $2.25 per 25. Answer sheets, $.90 per 25. 40 minutes. 

Educational Testing Service. ° 


7. Davis-Roahen-Schrammel American Literature Test, 1938. Grades 9-12, 
and college. Based on standard selections by American authors, Covers 
content, authorship, recognition, and understanding of quotations, and appreci- 
ation of literary values. 

Forms A and B. $1.20 per 25. 60 minutes. 

Bureau of Educational Measurements. 


8. Essentials of English Tests. 1939. Grades 7-13. Spelling, Grammatical 
Usage, Word Usage, Sentence Structure, Punctuation, and Capitalization. 

Forms A, B, and C. $1.95 per 25. 45 minutes. 

Educational Test Bureau. 


9. Greene-Stapp Language Abilities Test. 1951. Grades 9-13. Capitaliza- 
tion, Spelling, Punctuation, Sentence Structure and Applied Grammar, and 
Usage and Applied Grammar. 
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Forms Am and Bm. $6.00 per 35. Answer sheets, $1.65 per 35. Two 
hours. 
World Book Company. 


10. Hudelson Typical Composition Abilily Scale. 1923. Grades 4-12. A 
graded series of compositiohs varying by equal degrees of merit from excellent 
to very poor. The quality of the individual pupil's composition is determined 
by comparing it with the scale. 

One form. $1.75 per 25; only one needed per class. $.08 per copy. 15 min- 
utes. 

Public School Publishing Company. 

11. Iowa Language Abililies Test. 1946. Intermediate, Grades 7-10. Spell- 
ing, Vocabulary, Usage, Capitalization, Punctuation, Sentence Sense, and 
Grammatical Form. : 

Forms A and B. $4.30 per 35. Answer sheets, $2.25 per 35. Separate 
answer sheet edition, $5.60. 

World Book Company. 

12. New Purdue Placement Test in English. 1954. Upper high school years 
and college. Recognition of Grammatical Errors, Punctuation, Sentence Clear- 
ness and Effectiveness, Reading (Study), Reading (Pleasure), Vocabulary, 


Spelling. 

Forms D and E. $4.50 per package of 35 tests and 35 answer booklets. 
$2.10 per 35 answer booklets. 65 minutes. 

Houghton Mifflin Company. 

13. Test of English Usage. 1950. High school and college. Capitalization, 
Use of Apostrophe, and Punctuation; Word Usage, Building Sentences and 


Paragraphs. 
Forms A and B. $.10 per copy. 100 minutes. 
California Test Bureau. 


Social Studies 


The field of social studies at the secondary level generally includes civics, 
ancient history, modern European history, world history, U.S. history, and 
a course often called problems of democracy or American government. 
Standardized tests are available in each of these subjects, and most leading 
publishers have offerings in at least several, One standardized test in 
American history will be described in detail, and others in the various sub- 
jects will be listed for reference. 


Crary American History Test 
1. Name of test and authors. Crary American History Test, Ryland W. 
Crary, Teachers College, Columbia University. 


2, Nature and purposes. The test is intended to measure the achieve- 
ment of pupils with respect to the important objectives of a high school course 


4 Quotations in test description by permission of the publisher. 
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in American history. The objectives and the number and proportion of items 
emphasizing each are as follows: 


Factual Information, 28 items (31%) 


Important dates .............. 5 
Important laws ... 4 
Important ideas .... eu 
Important treaties ..... 2.415 
Advancement in democracy. .... 5 
Advancement in science . ....... 4 
Skills, 16 items (18%) 
Sources of information ......... 4 
"Time relationships ............. 4 
Map skills) te 9 arise nentes 8 


Interpretation of Historical Information, 8 items (9%) 
Understanding of Historical Processes, 26 items (29%) 
Reasoned Inferences, 12 items (13%) 


3. Grade level. No particular grade level is specified, but it is intended to 
be given at the end of a year's course in American history in the high school. 

4. Number of forms. Am and Bm. 

5. Publisher and date of publication. World Book Company, 1951. 

6. Cost. $3.40 per 35. Separate answer sheets required. $1.35 per 35. 


7. Content. The first 50 items in the test consist of sets of matching 
questions. "These deal with dates of important events, sources of information, 
methods and processes in democratic government, accomplishment of im- 
portant treaties and constitutional amendments, etc. 


Sample: 
Column I Column II 
a. developments in radio l. Edison 
b. atomic energy research 
c. combine thresher, 2. Marconi 
d. incandescent lighting 
e. mass production 3. Urey 
f. wireless telegraphy 
g. radar development 4. Fleming 
h. penicillin research 


Although the test is not divided into parts, the second section consists of 
multiple-choice items with a few true-false items based on a reading selection. 
Part of the multiple-choice items are based on an outline map of the United 
States, and test for knowledge of important cities and localities identified by 
what makes them important rather than by their names. 


Sample: 
The chief automobile manufacturing center of America is rep- 
resented on the map by 
a. l b. 4 c. 6 d. 11 
e. none of the above 
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The true-false questions represent conclusions which are correct or incorrect 
according to statements in the passage read. These conclusions are not all 
factual, some being in the nature of inferences. The next 21 items are in 
multiple-choice form and bear on the entire range of objectives for the test, 
but they emphasize thought and understanding rather than knowledge alone. 
The last three items are a variation of multiple-choice in which the question is 
followed by five or six choices, which in turn are followed by five answers, each 
representing a different combination of the choices. 


Sample: 
What conditions contributed to the economic depression of 
the early 1930s? 


1. The lack of farm prosperity in the 1920's 
2. The decline of foreign markets after World War I 
3. The lack of purchasing power of low-income groups 
4, The large military budgets of the 1920's 
5. The lack of industrial capacity and natural resources 
a. 1, 2,3 
b. 1, 2, 4 
c. 2,3,5 
. 1, 4,5 
e. all of the above 


8. Time required. 40 minutes. 


9. Directions for, and ease of, administration. Directions for adminis- 
tering are very brief but complete; the test is practically self-administering. 


10. Validity. Evidence for the validity of the Crary American Hislory Test 
rests upon two bases: 


a. Analysis of textbooks, courses of study, and pronouncements 
of national committees for the social studies 

b. Statistical analysis of items in the preliminary forms after 
tryout 

Step a was employed to determine objectives and emphases, and the weights 
to be assigned them in the test and in the content of the items. 

Step b yielded difficulty values and discrimination indexes for each item tried 
out. In terms of the percentage of pupils passing each item, the mean difficulty 
value for the items in Form Am was 48 per cent. The mean discrimination 
index was .46. No data on these criteria are given in the manual for Form Bn. 


1l. Reliability. Corrected split-half reliability coefficients of .87 and .91 
are reported, though it is not stated to which form or forms these apply. The 
standard error of measurement for the test is 4.0 standard score points. 


12. Manual. The manual is brief and seems more like a preliminary edition 
than a final or complete one. Necessary information for using the test is in- 
cluded, but subsequent editions will probably go into more detail. 


13. Scoring. The test is easily and quickly scored, either by use of a hand- 
scoring stencil or by a test-scoring machine. No spaces are provided on the test 
for marking answers, so the scoring is always done on printed answer sheets. 
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Raw scores are converted into normalized standard scores having a mean of 104 
and a standard deviation of 12.5. These standard scores are as comparable as 
any type of score generally used with standardized tests. All scores on tests 
in the Evaluation and Adjustment Series, of which the Crary test is one, are ex- 
pressed in these normalized standard scores. This makes it possible to compare 
an individual pupil’s score on different tests in the series and with his ability 
level as measured by the Terman-McNemar Test of Mental Ability. 


14. Norms. Percentile norms are provided for end-of-year administration of 
the test. They are based upon the scores of 6,178 pupils in 55 schools in 21 
states, 


15. Format. The test manual and other materials are excellent in organ- 
ization, composition, and quality. The manual is printed on rather cheap 
paper, due, perhaps, to its temporary nature as suggested above. 


Other Tests in Social Studies 


l. American Civics and Government Test. 1949. High school and college. 
National, state, and local government. 

Forms A and B. $1.25 per 25. 40 minutes. 

Public School Publishing Company. 


2. California Tests in Social and Related Sciences. 1946-53. Advanced, 
Grades 9-12. 1. Creating a New Nation (to 1789). 2. Nationalism, Section- 
alism, and Conflict (1790-1876). 3. Emergence of Modern America (1877— 
1918). 4. The United States in Transition (since 1918). 

Forms AA and BB. 1 and 2 in a booklet, $.08 per copy; 3 and 4 in a booklet, 
$.08 per copy. 1 and 2, 45 minutes; 3 and 4, 45 minutes. 

California Test Bureau. 


3. Cooperative American Hislory Test. 1947-49. High school and college. 
Basic facts and trends in the economic, social, and political development of the 
United States. Approximately one-half the items cover the period since 1865. 

Forms X, Y, and Z. $2.75 per 25. 40 minutes. 

Educational Testing Service. » 


4. Cooperalive Ancient History Test. 1938-39. High school grades. Funda- 
mentals underlying broad understanding of ancient backgrounds of our civiliza- 
tion. 

Forms O and P. $2.50 per 25. 40 minutes. 

Educational Testing Service. 


5. Cooperalive Modern European History Test. 1947-48. High school and 
college. Historical development from the Middle Ages to the present. 

Forms X and Y. $2.75 per 25. 40 minutes. 

Educational Testing Service. 


6. Cooperalive Social Studies Test. 1948. Grades 7-9. American and World 
History, Geography, and Civics. 

Forms X and Y. $2.95 per 25. 80 minutes. 

Educational Testing Service. 
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7. Cooperative Test in American Government. 1947. Upper high school 
grades. Organization and background of American government. 

Forms X and Y. $2.75 per 25. 40 minutes. 

Educational Testing Service. 


8. Cooperative Test of General Proficiency in the Field of Social Studies. 1950. 
Grades 12-13. Part I, social studies concepts; Part II, interpretation of maps, 
graphs, and reading selections. Emphasizes contemporary economic, social, 
and political life. 

Forms XX and YZ. $2.75 per 25. Answer sheets, $.90 per 25. 40 minutes, 

Educational Testing Service. 


9. Cooperative World History Test. 1947-49. High school grades. Major 
political, social, and economic trends from prehistoric times to the present. 

Forms X, Y, and Z. $2.75 per 25. 40 minutes. 

Educational Testing Service. 


10. Cummings World History Test. 1950. High school end-of-course test. 
Major historical events, dates, places, and leaders. History of ancient times, 
medieval and modern Europe, and the world wars. 

Forms Am and Bm. $3.90 per 35. Answer sheets, $1.35 per 35. 40 minutes. 

World Book Company. 


11. Dimond-Pflieger Problems of Democracy Test. 1952. High school end-of- 

course test. Government, economics, sociology, and international affairs. 
Forms Am and Bm. $3.40 per 35. Answer sheets, $1.35 per 35. 40 minutes. 
World Book Company. 


12. Hills Economics Test. 1940. High school and college. Basic facts, 
principles, and theories of economics. 

One form. $1.20 per 25. 40 minutes. 

Bureau of Educational Measurements. 


13. Modern Geography and Allied Social Studies. 1949. Grades 6-9. 
Trade Routes and Their Products; “Causal Geography," The United States; 
“Causal Geography,” The World; Miscellaneous Geographical Facts; Inven- 
tion, Power, Transportation. Communication; Geographical Vocabulary; 
World Products, Their Sources and Uses; Economic and Human Relations; 
Place Geography, The United States and the Western Hemisphere; Place 
Geography, Europe and the Eastern Hemisphere. 

Forms A, B, C. $3.30 per 25. 90 minutes. 

C. A. Gregory Company. — 

14. National Achievement Tests: American History, Government, Problems of 


Democracy. 1942, 1953. Grades 9-12. Growth of a national spirit and de- 
mocracy: the constitution, foreign policy, and problems of American democ- 


racy. 
Forms A and B. $2.50 per 25. 40 minutes. 
Acorn Publishing Company. 


15. National Achievement Tests: Social Studies. 1945. Grades 7-9. Human 
relations, life situations, social interpretations, values of products, social ideas, 
and miscellaneous facts. 
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Forms A and B. $2.50 per 25. 35 minutes. 

Acorn Publishing Company. 

16. National Achievement Tests: World History. 1948. High school and 
college. Social studies terms, world geography, contributions of world peoples 
to civilization; political history; economic, social, and cultural history. 

Forms A and B. $2.50 per 25. 40 minutes. 

Acorn Publishing Company. 

17. Understanding American History. 1940. Grades 8-12. Character 
Judgment, Historical Vocabulary, Sequence of Events, and Cause-and-Effect 
Relationships. 

One form. $1.50 per 25. 25 minutes. 

Public School Publishing Company. 


Natural Science 


The commonly taught subjects in high schools in the natural science area 
are general science, biological science, chemistry, and physics. Limitations 
of space prohibit the description of a test in each of these subjects, so again, 
one will be presented which is typical of those in its field. 


Nelson Biology Test 
1. Name of test and author. Nelson Biology Test.’ Clarence H. Nelson, 
Michigan State University. 
2. Nature and purposes. The general objectives of the test and the num- 
ber of test items covering each are given below: 
1. Knowledge of biological facts, concepts, and principles: Form Am, 16; 


Form Bm, 12. 

2. Understanding of biological facts, concepts, and principles: Form Am, 
23; Form Bm, 26. 

3. Ability to recognize cause-effect relationships: Form Am, 1; Form 
Bm, 2. 


4. Ability to interpret data and to draw sound conclusions therefrom: 
Form Am, 12; Form Bm, 13. 

5. Ability to recognize and to test, hypotheses; to recognize and to solve 
problems: Form Am, 10; Form Bm, 12. 

6. Ability to evaluate critically experimental procedures and real situa- 
tions having scientific implications: Form Am, 13; Form Bm, 10. 


3. Grade level. Ninth or tenth grade. 

4. Number of forms. Am and Bm. 

5. Publisher and date of publication. World Book Company, 1951. 
6. Cost. $3.90 per 35. Answer sheets, $1.35 per 35. 


7. Content. The major content areas included in the test are: (a) Living 
organisms, their kinds, characteristics, and grouping; cells and protoplasm; 


5 Quotations in test description by permission of the publisher. 
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adaptations. (b) Processes essential to the life of the individual, including food 
manufacture and utilization; circulation and excretion; nervous and endocrine 
coordination. (c) Conservation. (d) Parasitism, disease, and health. (e) Re- 
production and heredity. (f) History of life on earth. 

The test items are based on analysis of eight widely used high school text- 
books, three yearbooks of the National Society for the Study of Education deal- 
ing in whole or in part with science education, U.S. Department of Agriculture 
bulletins, research literature on the teaching of science, news bulletins and 
circulars from biological supply houses, proceedings of symposia of the Amer- 
ican Association for the Advancement of Science, and evaluation sections of 
the volume, Science in General Education, published by the American Council 
on Education. 

The test items, most of them in multiple-choice form, are carefully con- 
structed to measure the objectives of the content areas outlined above. Each 
form contains several line drawings of biological phenomena on which questions 
are based. The total number of items is 75. 

On the whole, this test is one of the most carefully and thoughtfully con- 
structed of its type and reveals a high degree of craftsmanship and competence, 
both in content and technique. 


8. Time required. 40 minutes. 


9. Directions for, and ease of, administering. The test is practically 
self-administering. The examiner sees that each student fills in the necessary 
information about himself, and that all students understand the directions. 


10. Validity. The evidence for validity of the test is principally of two 
kinds. The first has already been referred to under No. 7, above, and is cur- 
ricular in nature. The second type of evidence offered consists of the discrimi- 
nation indexes for each item. Items that did not show satisfactory discrimi- 
nating power in the initial tryouts were not retained for use in the final forms. 
Finally, all teachers whose classes took the preliminary forms were asked to 
criticize the items with respect to coverage, clarity, appropriate difficulty, etc. 


11. Reliability. Corrected split-half reliability coefficients of .87 and .88 
were obtained with samples of biology students in two different communities. 
The correlation between the two forms administered to the same students less 
than a week apart was .77. The standard error of measurement for a single 
score is 4.3 standard score points, These reliabilities are not as satisfactory as 
would be hoped for in an instrument so highly acceptable otherwise. 


12. Manual. The manual for this test, as with others in the Evaluation and 
Adjustment Series, gives the impression of being less full and complete than it 
should be and is perhaps an intermediate step towards a better one. It is ade- 
quate as a guide for administering and scoring, and gives some suggestions for 


use of the test. 


13. Scoring. Answer sheets must be used with the test. These may be 
either hand-scored with a cut-out stencil, or machine-scored. Hand-scoring is 
simple, and scores are easily obtained by counting the number right. 


14. Norms. There are two types of norms: normalized standard scores and 
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percentile norms. The standard scores provide a means of making scores on 
all the Evaluation and Adjustment Series comparable to each other and to the 
Terman-McNemar Test of Mental Ability. The percentile norms are based upon 
end-of-year testing of nearly 5,000 students in 63 schools in 27 states. 


15. Format. The test is well printed on good quality paper. The format of 
the test and accessories, except as noted, is excellent. 


Other Tests in Natural Science 


1. Anderson Chemistry Test. 1951. Grades 11, 12. End-of-course test for 
high school chemistry. Understanding of facts and concepts; understanding and 
application of functional principles; understanding and application of scientific 
method; ability to use basic skills in chemistry. 

Forms Am and Bm. $3.90 per 35. Answer sheets, $1:35 per 35. 40 minutes. 

World Book Company. 


2. Cooperative Biology Test. 1948. High school classes in biology. General 
information in biological science; understanding of, and ability to use, basic 
principles; ability to interpret materials not generally encountered in textbooks. 

Multiple Forms: latest is Form Y. $2.75 per 25. 40 minutes. 

Educational Testing Service. 


3 Cooperalive Chemistry Test. 1950. High school classes in chemistry. 
Knowledge of fundamental concepts, terms, reactions, preparations, atomic 
structure, and chemistry related to daily living. Ability to apply knowledge of 
chemistry and to interpret scientific information. 

Multiple Forms: latest is Form Z. $2.75 per 25. 40 minutes. 

Educational Testing Service. 


«4. Cooperative General Science Test. 1950. High school classes in general 
science. Understanding of reasons behind familiar scientific phenomena and 
processes. : ; 

Multiple Forms: latest is Form Z. $2.75 per 25. 40 minutes. 
Educational Testing Service. 


5. Cooperalive Physics Test. 1949. High school classes in physics. Mechan- 
ics, heat, light, sound, and electricity; numerical problems and interpretation of 
diagrams. 

Multiple Forms: latest is Form Z. $2.75 per 25. 40 minutes. 

Educational Testing Service. 


6. Cooperative Science Test for Grades 7, 8, and 9. 1948. Grade 9 and superior 
pupils in Grades 7 and 8. Informational background, terms, concepts, and 
understanding, interpretation, and ability to apply ideas from scientific reading 
selections. 

Multiple Forms: latest is Form Y. $2.95 per 25, 80 minutes. 

Educational Testing Service. 


7. Dunning Physics Test.’ 1951. Grades 11 and 12. End-of-course test for 
high school physics. Knowledge and understanding of basic facts, principles, 
and laws. Mechanics, heat, sound, light, electricity, and modern physics. 
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Forms Am and Bm. $3.90 per 35. Answer sheets, $1.35 per 35. 45 minutes. 
World Book Company. 


8. Dvorak General Science Scales. 1942. Grade 9. Knowledge of facts, 
processes, and organization in general science. 

Forms R-1, S-2, and T-2. $1.50 per 25. 20 minutes. 

Public School Publishing Company. 


9, Interpretation of Data Test, General Education Series. 1939-40, 1950. 
Lower Level, Grades 7-12; Upper Level, Grades 12-14. General accuracy, ac- 
curacy in recognition of true or false statements, accuracy with insufficient data, 
overcaution, going beyond data, crude errors; and going beyond probably true 
and probably false statements (Upper Level only). 

Forms A and B. Lower Level, $2.25 per 25; Upper Level, $2.50 per 25. 40 
minutes. 

Educational Testing Service. 


10. Read General Science Test. 1951. Grade 9. Knowledge and under- 
standing of the basic facts and principles of the physical and biological sciences 
and their applications in problem-solving situations. 

Forms Am and Bm. $3.90 per 35. Answer sheets, $1.35 per 35. 40 minutes. 

World Book Company. 

11. Test of Application of Principles in Physical Science, General Education 
Series. 1940-50. High school exercises in application of five principles of 


physical science. 
One form. $2.50 per 25. 40 minutes. 
Educational Testing Service. i 


12. Test of General Proficiency in the Field of Natural Sciences. 1950. Grade 
12 and college freshmen. Knowledge and understanding of fundamental terms 
and concepts of elementary biology, chemistry, and physics. Ability to read, 
comprehend, and use the ideas of selections of scientific material from news- 
papers, magazines, and textbooks. 

Multiple Forms: latest is Form Z. $2.75 per 25. 40 minutes, 

Educational Testing Service. 


Mathematics 


Enrollments in high school courses in mathematics have shown a steady 
decline, percentage-wise, during the last twenty-five or thirty years. En- 
rollments in many other subjects have shown the same trend because the 
fact of the greatly enriched and expanded offerings of high schools generally 
is reflected in a smaller proportion of all high school pupils electing any one 
subject. Required courses constitute exceptions to this generalization. The 
decline in percentage enrollments in mathematics has been more marked,. 
however, than in the majority of other subjects. . 

In spite of this trend, one finds a relatively large number of standardized 
tests in mathematics available in published form. There are many such 
tests in algebra, and no scarcity of them in general mathematics, plane 
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geometry, solid geometry, and trigonometry. One reason for this may be 
that mathematics, since it is exact and objective, lends itself readily to 
objective measurement. Another reason, perhaps related to the one above, 
may be that the content of mathematics courses is more generally standard- 
ized than that in other areas, and mathematics therefore is a good field for 
the development of objective, standardized tests. Whatever the reasons, 
teachers of mathematics should have no difficulty in finding suitable meas- 
uring instruments. 

In choosing a mathematics test for illustrative purposes it was decided 
that one in the field of general mathematics would probably be most inter- 
esting and informative, since it would cover a wider range of objectives and 
content than a test in a particular subject such as algebra or trigonometry. 
As before, other standardized tests in mathematics are listed at the end of 
this section. 


Cooperative Mathematics Test for Grades 7, 8, and 9 


1. Names of test and authors. Cooperative Mathematics Test for Grades 
7, 8, and 98 Bernice Orshansky and H. Vernon Price. 


2. Nature and purposes. Part I, Skills; Part IT, Facts, Terms, and Con- 
cepts; Part III, Applications; Part IV, Appreciation. Covers basic arithmetic 
processes, meanings of common quantitative terms and expressions, simple 
concepts of algebra and geometry. Purports to measure growth in the funda- 
mentals of mathematics, and to be useful for placement, guidance, curriculum 
study, and administrative surveys. 


3. Grade level. Primarily for Grade 9, but may be used with superior 
pupils in Grades 7 and 8. 


4. Number of forms. Multiple Forms: latest is Form Y. 


5. Publisher and date of publication. Educational Testing Service, 
1951. 


6. Cost. $2.95 per 25. Answer sheets, $.90 per 25. 


7. Content. Part I, Skills, consists of 45 multiple-choice items covering 
fundamental operations in arithmetic, fractions, decimals, percentage, mensura- 
tion, and square root. 


Samples: 
V16 equals 
39-1 1$ 
39-2 2 
39-3 8 
39-4 4 
39-5 32 


6 Quotations in test description from Cooperative Mathematics Test for Grades 7, 8, and 
9, by Bernice Orshansky and H. Vernon Price. Copyright 1948 by ren "Testing 
Service. Reprinted by special permission of Educational Testing Service. 
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In decimal form, 5% of 4 is 


33-1  .01 
33-2  .025 
33-39. 1 
33-4 -25 
33-5 2.5 


Part II consists of 30 multiple-choice items measuring understanding of 
mathematical facts, terms, and concepts. 


Samples: 
How many inches are there in 4 yard? 
44 6 
4-9 12 
4-3 18 
4-4 24 
4-5 36 
Which of the following is a unit in the metric system? 
7-1 ounce 
7-2 centimeter 
7-3 yard 
7-4 bushel 
7-5 gross 
Sum is to plus as difference is to 
22-1 digit 
22-2 minus 
22-3 subtract 
22-4 remainder 
22-5 quotient 


Part III consists of 30 multiple-choice items testing ability to apply mathe- 
matical ideas in the solution of problems. Some of the questions are based on 


graphs or figures. 


Samples: 
If the side of a square is x + 1, the perimeter of the square is 
20-1 2r42 
20-2 2x1 
20-3 4a+4 
20-4 4a+1 
20-5 #41 


If a man spends 12% of his salary on bonds, and buys a $37.50 
bond each month, what is his monthly salary? 
24-1 $312.50 
24-2 $312.60 
24-3 $350.00 
24-4 $316.20 
24-5 $450.00 
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Part IV consists of 25 multiple-choice items designed to measure compre- 
hension of mathematical ideas and the niceties of mathematical devices and 
expression. 
Samples: 
On a true-false test, a student answered 58 questions as "true" 
and 64 questions as “false.” How many questions did he omit? 


8&1 None 
8&2 6 
83 3 
84 8 


8-5 The number cannot be determined from the 
information given 


Which of the following has no volume? 


20-1 cylinder 

20-2 cone 

20-3 square 

20-4 cup 

20-5 rectangular box 


8. Time required. Part I, 30 minutes; Part II, 10 minutes; Part IIT, 30 
minutes; Part IV, 10 minutes; total, 80 minutes. 


9. Directions for, and ease of, administering. The test is practically 
self-administering. After directions are given and understood, the examiner 
starts the test and at the end of 80 minutes stops the work and collects the 
papers. All items are of the multiple-choice type and answers are indicated 
on the answer sheet in the same manner throughout the whole test. 


10. Validity. As with most of the tests produced by the Cooperative 
Test Division of the Educational Testing Service, no information is given on 
the validity of this test. Consequently, one must assume that the tests have 
validity insofar as the judgment and skill of the authors can provide it. Fur- 
thermore, since validity is specific to the purposes of the user of a given test, 
the user himself will have to decide to what extent the test is valid for his par- 
ticular purpose. 

An attempt has been made in the test to incorporate measurement of func- 
lional mathematics rather than mere knowledge and rote memory. The authors 
seem to have attained some success in this objective. Whether the four parts 
of the test represent valid distinctions is, as yet, not established. It appears 
that some items could be placed just as logically in other parts as in the parts 
where they now appear. 


ll. Reliability. No information is given. 


12. Manual. Unlike most standardized tests, the Cooperalive Tesis do not 
have a separate manual for each one; there is only a general manual for all of 
the tests. The fact that the tests are practically self-administering makes this 
deficiency less serious, though users might be better satisfied with a conven- 
tional manual for each test. 


Achievement Tesls in Specific Subjects 231 


13. Scoring. Answers may be marked on the test blank or on separate answer 
sheets. Test blanks are scored with strip keys by hand; answer sheets may be 
scored either by hand or by test-scoring machines. A perforated stencil costing 
$.15 is available for scoring answer sheets by either method. Each part is 
scored separately, with a correction for chance being subtracted. Raw scores 
are converted to scaled scores or percentiles. 


14. Norms. Scaled scores are standard scores which, in a sense, are a type 
of norm. The basic norms, however, are percentile norms. No information is 
given on the nature, size, or geographical distribution of the normative popula- 
lion. 


15. Format. All the Cooperative Tests are attractively printed on good 
quality paper, and the arrangement, type, and other details of format are of 
the best. 


Other Tests in Mathematics 


1. Becker-Schrammel Plane Geometry Test. 1934. Test I, first semester, Test 
II, second semester, of plane geometry. Geometrical reasoning, computation, 
proofs, and constructions. 

Forms A and B. $1.20 per 25 of either test. 40 minutes. 

Bureau of Educational Measurements. 


2. BlythSecond- Year Algebra Test. 1951. End of second-year algebra course 
inhigh schools. Symbolic expression, factoring, radicals, exponents, logarithms, 
simple progressions, linear and quadratic equations, and graphic methods, 

Forms Am and Bm. $3.05 per 35. 45 minutes. 

World Book Company. 


3. Colvin-Schrammel Algebra Test. 1937. Test I, first semester, Test II, 
second semester, of high school algebra. Test I, formulas, signed numbers, equa- 
tions, monomials, and polynomials. Test II, the same, and also simple quad- 
ratic equations soluble by factoring, fractions, liberal equations, and simulta- 
neous linear equations. 

Forms A and B. $1.20 per 25 of Test I or Test II. 40 minutes. 

Bureau of Educational Measurements. 


4. Cooperative Algebra Test: Elementary Algebra Through Quadralics. 1950. 
High school classes in elementary algebra. Covers basic knowledge, skills and 
applications of elementary algebra. Includes interpretation of graphs and 
charts. 

Multiple Forms: most recent is Form Z. $2.50 per 25. 40 minutes. 

Educational Testing Service. 


5. Cooperative General Mathematics Tes! for High School Classes. 1953. High 
school students with three or four years of mathematics, or entering college 
freshmen. Approximately two-thirds of the test deals with algebra and plane 
geometry; the other third is almost equally divided among arithmetic, trigo- 
nometry, and solid geometry. 

Form O. $2.25 per 25. 40 minutes. 

Educational Testing Service. 
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6. Cooperative Intermediate Algebra Test: Quadratics and Beyond. 1950. 
High school classes in intermediate algebra. Quadratics, exponents, factoring, 
progressions, logarithms, imaginary numbers, radicals, simultaneous equations, 
graphs, right triangle (trigonometric) relationships, and proportion. 

Multiple Forms: the most recent is Form Z. $2.50 per 25. 40 minutes. 

Educational Testing Service. 


7. Cooperative Mathematics Pre-test for College Students. 1950. For stu- 
dents beginning mathematics in college. Mastery of skills in arithmetic, al- 
gebra, plane and solid geometry, and trigonometry. Most of the test deals with 
algebra. (The distribution of this test is restricted to colleges and universities.) 

Multiple Forms: most recent is Form Y. $1.75 per 25. 40 minutes. 

Educational Testing Service. 


8. Cooperalive Plane Geometry Test. 1950. High school classes in plane 
geometry. Knowledge of theorems concerning circles, triangles, polygons, and 
constructions; application of pertinent theorems in deductive reasoning and 
computational situations; ability to analyze the formal proofs of original prob- 
lems. 

Multiple Forms: most recent is Form Z. $2.50 per 25. 40 minutes. 

Educational Testing Service. 


9. Cooperative Solid Geometry Test. 1954. High school classes in solid geom- 
etry. Mastery of essential formulas of solid geometry and ability to apply them. 

Forms O and P. $1.75 per 25. 40 minutes. 

Educational Testing Service. 


10. Cooperative Test of General Proficiency in the Field of Mathematics. 1951. 
Grade 12 and entering college freshmen; also suitable for superior students in 
Grades 10 and 11. Basic mathematical notation and fundamental concepts in 
the fields of algebra, plane geometry, and trigonometry; measures ability to 
read and interpret information presented in graphs, tables, charts, and para- 
graphs taken from newspapers, magazines, advertising brochures, etc. 

Multiple Forms: the most recent is Form YZ. $2.75 per 25. 40 minutes. 

Educational Testing Service. 


11. Cooperative Trigonometry Test. 1950. High school and college classes in 
trigonometry. Basic definitions, formulas, computations, and applied work 
problems. 

Multiple Forms: latest is Form Y. $2.50 per 25. 40 minutes. 

Educational Testing Service. 


12. Davis Test of Functional Competence in Mathematics. 1951-52. Grades 
9-13. Consumer problems, graphs and tables, symbolism, equations, ratio, 
tolerance, etc. 

Forms Am and Bm. $3.90 per 35. Answer sheets, $1.35 per 35. 80 minutes. 

World Book Company. 


13. Foust-Schorling Test of Functional Thinking in Mathematics. 1942. 
Grades 9-12. Ability to recognize relationships, to interpret mathematical 
statements, and to express relationships in symbolic language. Computational 
facility and skills in manipulation are not measured. 
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Forms A and B. $2.65 per 35. 45 minutes. 
World Book Company. 


14. Funclional Evaluation in Mathematics. 1952. Upper Level, Grades 7-9: 
Test 4, Quantitative Understanding; Test 5, Problem Solving; Test 6, Basic 
Computation. 

Form A. Test 4, $1.95 per 25; Tests 5 and 6, $1.10 per 25. Manual, $.55 

Educational Test Bureau. 


15. Garman-Schrammel Third-Semesler Algebra Test. 1934-40, Quadratic 
equations, functions, ratios, proportion, and variation, radicals, exponents, 
imaginary numbers, logarithms, arithmetic and geometric progressions, and 
the binomial theorem. 

Forms A and B. $1.20 per 25. 40 minutes. 

Bureau of Educational Measurements. 


16. General Mathematics Test. 1942, 1952. Grades 7-9. Knowledge of 
essential concepts, skills, and insights in mathematics; ability in arithm:t c, 
algebraic and geometric concepts, applications, problem analysis, and reasoning. 

Form A. $2.50 per 25. 52 minutes. 

Acorn Publishing Company. 


17. Lane-Greene Unit Tests in Plane Geometry. Revised, 1944. High school 
plane geometry. Test 1, Fundamental Ideas of Geometry; Test 2, Parallel Lines 
and Triangles; Test 3, Rectilinear Figures; Test 4, The Circle; Test 5, Propor- 
tion and Similar Polygons; Test 6, Areas of Polygons. Available in a 32-page 
booklet only. 

Forms A and B. $3.50 per 25. Each test requires from 35 to 38 minutes. 

Bureau of Educational Research and Service. 


18. Lankton First-Year Algebra Test. 1951-52. End-of-course in first-year 
algebra. Vocabulary, meaning and use of symbols, fundamental operations, 
formulas, equations, simple algebraic fractions, radicals, ratio, proportion, vari- 
ation, graphs, trigonometric functions, and algebraic solution of problems. 

Forms Amand Bm. $3.40 per 35. Answer sheets, $1.35 per 35. 40 minutes. 

World Book Company. 


19. Larson-Greene Unit Tests in First-Year Algebra. Revised, 1947. First- 
year algebra in high schools. Tests 1, 2 and 3 for first semester; Tests 4, 5 and 
6 for second semester. Available in a 24-page booklet only. 

Forms X and Y. $3.50 per 25. Each test requires from 36 to 40 minutes. 

Bureau of Educational Research and Service. 


20. Rasmussen General Mathematics Test. 1942. High school and college. 
Basic facts, principles, theories, and problems common to six reputable texts 
in this field. 

Forms A and B. $1.20 per 25. 40 minutes. 

Bureau of Educational Measurements. 


21. Rasmussen Trigonometry Test. 1940. High school and college. Basic 
facts, principles, theories, and problems commonly included in elementary 


textbooks. 
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Forms A and B. $1.20 per 25. 40 minutes. 
Bureau of Educational Measurements. 


22. Schrammel-Reed Solid Geometry Tests. 1950. High school and college. 
Catalog states: “a comprehensive and thoroughly objective test in solid 
geometry.” 

Forms A and B. $1.20 per 25. 50 minutes. 

Bureau of Educational Measurements. 


23. Seattle Algebra Test. 1951. End of first half-year of algebra in high 
school. Understanding of basic terms, fundamental processes with signed 
quantities, sequence of numerical operations, practical formulas, multiplication 
of binomials, solution of equations of the first degree by the rules of equality, 
solution of simple simultaneous equations, algebraic representation, and prob- 
lems. 

Forms Amand Bm. $2.85 per 35. Answer sheets, $1.35 per 35. 40 minutes. 

World Book Company. 


24. Seattle Plane Geometry Tesi. 1951. End of first half-year of plane ge- 
ometry. Vocabulary of geometry, knowledge of simple geometric construction, 
computational skills, and ability to reason from a figure. 

Forms Amand Bm. $3.40 per 35. Answer sheets, $1.35 per 35. 45 minutes. 

World Book Company. 


25. Shaycoft Plane Geometry Test. 1951-52. End of one-year course in plane 
geometry. Fundamental concepts, lines and rectilinear figures, the circle, 
proportions, area of polygons, and geometric reasoning. 

Forms Amand Bm. $3.40 per 35. Answer sheets, $1.35 per 35. 40 minutes. 

World Book Company. 


26. Snader General Mathematics Test. 1951-52. End of one-year course in 
general mathematics. Arithmetical concepts and processes, informal geometry, 
graphic representation, algebraic principles and skills, and numerical trigo- 
nometry. Measures not only computation and manipulation skills, but also 
application through problem situations, 

Forms Amand Bm. $3.90 per 35. Answer sheets, $1.35 per 35. 40 minutes. 

World Book Company. 


* Learning Exercises e 


1. If you were to have a part in planning and constructing a standardized test 
"in a subject in your major field, how would you proceed? Prepare an outline of. 
your procedure and be prepared to explain how you would carry out each step. 

2. What are the advantages and disadvantages of a state-wide testing program 
such as those in New York, Ohio, or Iowa? Can you find any surveys or studies in 
educational literature that give evidence on this question? 

3. Is the content of fifth-grade arithmetic more standardized than that of ninth- 
grade algebra? How would you find out? 

4. Examine a specimen set of an achievement test in a high school subject, study- 
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ing carefully the test, manual, and other accessories. Does it seem to you an ade- 
quate instrument for measuring important outcomes in that subject? Give your 
reasons. After you have done this, consult one of The Mental Measuremenls Year- 
books to see how well you and the experts agree in your appraisal. 


Annolaled Bibliography 


1. Buros, Oscar K. (ed.). The Fourth Mental Measurements Yearbook. Highland 
Park, N.J.: The Gryphon Press, 1953. (See also The Third Mental Measurements 
Yearbook and The Nineleen Forly Mental Measurements Yearbook.) The most com- 
plete and useful review of measuring instruments and books on measurement in 
education and psychology. Reviews give factual information as well as reviewers’ 
criticisms. Reviews in earlier editions are cited in each subsequent edition. 


2. Committees on Test Standards of the American Educational Research As- 
sociation and the National Council on Measurements Used in Education. Tech- 
nical Recommendations for Achievement Tests. Washington, D.C.: American Edu- 
cational Research Association, 1955. 36 pp. An authoritative statement con- 
cerning the kinds of information test manuals should supply. Written primarily 
with test authors and publishers in mind, though it is a useful reference for students. 


3. Greene, Harry A., Jorgensen, Albert N., and Gerberich, J. Raymond. Meas- 
urement and Evalualion in the Secondary School, Second Edition. New York: Long- 
mans, Green and Company, 1954. Chapters 15-25. Approximately one-third 
of the volume is devoted to a discussion of objectives and measurement procedures 
in each of the common branches of instruction. Reference is made almost wholly 
Lo published standardized tests and the problems of developing such instruments 


in each subject. 


4. Jordan, A. M. Measurement in Education. New York: McGraw-Hill Book 
Company, Inc., 1953. Chapters 5-13. Nine chapters dealing with measurement 
in reading, spelling, handwriting, language and literature, social sciences, foreign 
languages, mathematics, science, business education, fine arts, manual arts, and 
physical education and health, respectively. Each chapter discusses objectives 
and describes tests for elementary grades and high schools in a particular subject 
area. 


5. The Measurement of Understanding. Forty-Fifth Yearbook of the National 
Society for the Study of Education, Part I. Chicago: The University of Chicago 
Press, 1946. 338 pp. Describes and gives excerpts from many standardized tests 
in social studies, science, mathematics, language arts, fine arts, health education, 
physical education, home economics, agriculture, technical education, and industrial 
arts. Also gives many helpful suggestions for the improvement of locally made 


tests in these areas. 


See also catalogs of test publishers listed in Appendix B. 
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The Measurement of Capacity: 


Intelligence and Aptitude 


The term “capacity” as used here includes both intelligence and aptitude. 
The first part of the chapter deals with the measurement of intelligence, the 
second with the measurement of aptitude. 

Numerous attempts have been made to define intelligence, yet educators 
and psychologists have never been able to come to complete agreement on 
the term or on the concepts which it involves. Substantial progress has 
been made in the measurement of intelligence, however; such progress has 
resulted from attempts to find measures that would differentiate feeble- 
minded from normal children, or pupils successful in school work from those 
who are less successful, as judged by their teachers. Actually, our defini- 
tion of intelligence is circular, since we are in effect saying that intelligence 
is what intelligence tests measure, and that it is what makes for success in 
academic work. This is true, at least insofar as the schools are concerned. 

A second type of capacity considered in this chapter is often referred to 
as "aptitude." As used here, the meaning of the term differs from intelli- 
gence in one important respect. Aptitude refers to capacities in special 
fields, such as music, art, or mechanics. Intelligence tests have sometimes 
been referred to as tests of “scholastic aptitude,” but this is probably a 
misnomer since aptitude tests are usually much narrower and more special- 
ized. In this chapter both types of tests will be described and illustrated, 
and the theoretical bases and use of each will be discussed briefly. 


THE MEASUREMENT OF INTELLIGENCE 


For a long time psychologists have been interested in the measurement of 
intelligence, and the history of the development of intelligence tests closely 
parallels the development of psychology as a science. Moreover, intelligence 
measurement is an area of study that is not without its controversial aspects. 
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Also, it must be recognized that the use of intelligence tests requires in many 
instances a considerable amount of technical knowledge and training, and 
in no instance can one expect to use such tests properly without at least 
some preparation. The first part of this chapter treats these three aspects 
of intelligence testing: the background of the movement, its associated 
problems, and its current procedures. 


Historical Backgrounds of Intelligence Measurement 


Present-day intelligence tests are based on the work of a French psycholo- 
gist, Alfred Binet. Associated with him were, first, V. Henri, and later, 
Théodore Simon. There can be no doubt that Binet was the shining light 
and genius of this most important contribution to modern psychological 
methods, even though many others contributed to the developments which 
culminated in his work. 

For instance, the work of Francis Galton (1822-1911), an English scientist, 
did much to stimulate interest in individual differences and their measure- 
ment. He devised various tests of sensory discrimination involving weights, 
tones, and mental imagery, and made important contributions to the ad- 
vancement of statistical methods. He is generally regarded as having been 
the first to use standard scores and correlation. 

In the United States a number of psychologists became interested in the 
possibility of measuring intelligence. At this point it must suffice to men- 
tion only one, James McKeen Cattell, who had more to do with early devel- 
opments along these lines than anyone else in this country. Cattell studied 
psychology under the famous German, Wilhelm Wundt, who was not very 
favorably disposed toward the “new psychology” — that is, the measure- 
ment of human abilities. However, Cattell became actively interested in 
the measurement of intelligence, and when he returned to this country did 
much to stimulate interest in it. In 1896 he devised and administered to 
students at Columbia University a series of tests largely of the sensory- 
motor type. These tests measured such traits or abilities as keenness of 
vision and hearing, reaction time, mental imagery, and perception of weight, 
colors, and tones. For the most part, the tests were simple, objective, and 
easily administered. Almost without exception, however, the results 
showed little relation to teachers’ estimates of their students’ intelligence, 
or to the students’ success in school work as measured by marks or by other 
means. 

Binet also experimented with most of these tests and dozens of others with 
similarly unsatisfactory results. He and his co-workers became convinced 
that the value of such tests for measuring intelligence was extremely limited. 
Gradually, however, he developed a new approach to the problem, and began 
to make real progress. Binet based his theory on the assumption that suc- 


238 The Measurement of Capacity 


cess would depend on the measurement of complex mental processes rather 
than specific traits. Accordingly, in 1905 he published his first scale of in- 
telligence developed for the purpose of identifying subnormal children in 
the schools of Paris. This scale represented a distinct departure from pre- 
vious efforts. In the first place, Binet’s tests were arranged in order of in- 
creasing difficulty, and-thus they constituted a scale for measuring the in- 
dividual’s level of mental development. In the second place, while the 
tests were of considerable variety, they were aimed to measure a complex, 
central factor in intelligence which Binet called “judgment.” Of course, 
Binet recognized that the thirty tests in the scale could not all be demon- 
strated to measure this one factor; nevertheless, the originality and purpose 
of the tests were clearly evident. Some of the items required the student to 
execute simple orders such as “‘ Close the door," to name objects designated 
in a picture, to cite from memory the differences between pairs of familiar 
objects such as wood and glass, and to construct a sentence embodying 
three given words such as “Paris, gutter, fortune." ! 

"This first scale was followed in 1908 by a revision in which the tests were 
grouped at ages or levels, a marked improvement over the serial order of the 
original scale. In 1911 a second revision was published. This was the cul- 
mination of Binet’s work, since he died in the same year at the age of fifty- 
four. This third scale was more complete, more carefully standardized, and 
more systematically scored than either of its predecessors. 

During Binet's lifetime several psychologists, most of them Americans, 
translated and used the first and the second scales, and criticized them, 
largely to Binet's benefit. However, the basic principles were established 
by Binet, and it was clearly demonstrated that the third scale in particular 
was an instrument superior in usefulness, accuracy, and scope to anything 
that had preceded it. The new need was for further translations and adap- 
tations of the scale for use with children of different nationalities, cultures, 
and languages, and this need was soon fulfilled. 

Three Americans are most closely associated with the development of the 
Binet tests. Henry Goddard translated both the 1905 and the 1907 scales 
into English and used them in his work at the Vineland, New Jersey, Train- 
ing School for Feeble-minded. In a similar way, Fred Kuhlmann used the 
early scales at an institution for feeble-minded at Faribault, Minnesota. 
He published the first translation and thoroughgoing revision of Binet's 
scales for use with American children in 1912. This edition has been re- 
vised several times subsequently, and it is still the standard scale in certain 
sections of this country. 

The major work in adapting the Binet scales for use with English-speak- 
ing subjects was done by Lewis M. Terman. He had carried on some ex- 


1 See Joseph Peterson, Early Conceptions and Tests of Intelligence (Yonkers-on-Hud: 
N.Y.: World Book Company, 1925), pp. 172-74. TAN 
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perimentation with tests independently, but had not gone very far when 
the Binet scales were developed. He set to work at once on these and in 
1916 published what became the most widely used and accepted intelligence 
test, known as the Slanford Revision of the Binet Scale. This was a careful 
translation and revision, involving a complete standardization on American 
children and adults. There were tests at Years IIT, IV, V, VI, VII, VIII, 
IX, X, XII, XIV, and for levels called Average Adult and Superior Adult. 
Terman rearranged many of the tests from the positions established by 
Binet, added new tests, especially at the upper end of the scale, and elimi- 
nated others. "This became the standard instrument for the measurement 
of intelligence in the United States and other English-speaking countries 
for more than twenty years. 

In time, certain shortcomings and weaknesses in Terman's scale became 
apparent, and in 1937 he and Maud R. Merrill published their revision of 
the Stanford-Binet Scale which remedied most, if not all, of these faults. 
The chief faults were a lack of equivalent forms, gaps in the scale, notably 
at Years XI and XIII, and incompleteness at both ends of the scale. The 
1937 Revision appeared in two equivalent forms; it provided tests for the 
missing years, and it extended from age two, thus providing a far more 
thorough testing below age six, to a much higher level, Superior Adult ITI. 
Also, the normative population was more adequate and more carefully 
selected than in the earlier scale.? A few samples from Form L of the Re- 
vised Stanford-Binel Tesls of Inlelligence will serve to show its nature. 


Year III-6. Drawing Designs: Cross 


Procedure: Give the child a pencil and as you draw a cross making 
diagonal lines about two inches in length (X) say to him, You 
make one just like this.” Illustrate once only. Give one trial. 


Score: "Requirement is that child shall make two lines that 
cross each other. We disregard the angle of crossing and the 
straightness and length of the lines." 


Year VIII. Comprehension IV 


Procedure: Ask 
(a) “What makes a sailboat move?" 
(b) “What should you say when you are in a strange city and 
someone asks you how to find a certain address?” 
(c) “What should you do if you found on the streets of a city 
a three-year-old baby that was lost from its parents?” 


Score: 2 plus. Sample answers are given. In (a) for instance, 
“wind,” “wind and water,” “wind and sails,” are acceptable; 
“water,” “the motor" are not. 


? Lewis M. Terman and Maud R. Merrill, Measuring Intelligence (Boston: Houghton 
Mifflin Company, 1937). Quoted by permission of authors and publisher. 
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Superior Aputr III. Repeating 9 Digits 


Procedure: Say, “I am going to say some numbers and when I 
am through I want you to say them just the way Ido. Listen 
carefully, and get them just right." Before each series repeat, 
“Listen carefully, and get them just right.” Rate, one per 
second. Avoid accent and rhythm. 
(a) 3 — 7 — 1, etc. 
(b) 7 — 3 — 9, etc. 
(c) 8—5— 2, etc. 
Score: 1 plus. The series must be repeated in correct order with- 
out error after a single reading. j 


The administration of this scale requires training, practice, and skill. 
No one should attempt it except as a learner under expert supervision until 
he has all three requisites. The examiner should be so thoroughly familiar 
with the procedure that he can give major attention to the presentation of 
the tasks, recording of responses, and legitimate encouragement of the sub- 
ject. Assuming that the user has attained satisfactory skill and has estab- 
lished rapport with the subject, the procedure for administering the scale 
is as follows: 


A. Establish the basal mental age. This is the highest level on the scale 
at which the subject passes all the tests. 

B. Proceed to give all the tests at successively higher levels, recording 
successes and failures at each level. 

C. Stop at the level at which the subject fails all the tests. 

D. Calculate the intelligence quotient. 

The procedure may be illustrated with the example of Mary, aged eight 
years and six months. The examiner starts at the seven-year level at which 
he finds she passes all the tests; she does likewise at eight; at the nine-year 
level she passes 5 out of 6; at ten, 3 out of 6; at eleven, 2 out of 6; at twelve, 
1 out of 6; and none at the thirteen-year level. To summarize: 

Mary, Age 8-6 
Basal M.A., 8 years 

Nine year level passes 5 X 28= 10 months 
Ten year level passes 3X 2= 6 months 
Eleven year level passes 2x2 2 4 months 
Twelve year level passes 1% 2= 2 months 

22 months 

Mental Age: 8 years + 22 mos. = 9-10 


M.A. 9-10 118 
IQ.= 100 = z 
ATOE E gg a 109 


3 Since there are six tests at each year level at this part of the scale, each test is given 
a weight of two months, that is, 12 + 6. At earlier levels there are six tests every half 
year. These count for one month each. At higher levels tests have a weight of three 
or more months each. 


X 100 = 116 
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The tests are placed at age levels such that the standardization population 
yields an I.Q. of very close to 100 at each level. Assuming the sample to 
be representative of the total population, this is as it should be. Earlier, 
Binet and others had placed tests at age levels where approximately three- 
fourths of the standardization population of a given age passed the test. 
This method was not entirely satisfactory and has been superseded by the 
method mentioned above. Under ideal circumstances, however, the two 
methods yield about the same results. 

Although the foregoing is an extremely limited presentation of the pro- 
cedure with the scale, it should suffice to give the student an idea of how the 
scale is used. 

The Binet scales and their counterparts in other countries had one notable 
disadvantage which soon became evident. They were individual scales, 
which meant that only one person could be tested at a time by a trained 
examiner. This is a time-consuming procedure, although a quite satis- 
factory one with regard to thoroughness, rapport between subject and 
examiner, and opportunity to observe the subject while being tested. How- 
ever, some American psychologists, notably Arthur S. Otis of The World 
Book Company, W. S. Miller of the University of Minnesota, Rudolf Pint- 
ner and E. L. Thorndike of Columbia University, and Terman himself, soon 
began experimenting with the adaptation of certain types of tests to group 
testing. In 1917, the entry of the United States into World War I gave 
this movement the impetus it required. Large numbers of men were being 
inducted into military service and the need for and potential usefulness of 
some kind of rapid, accurate, mental measurement of men soon became appar- 
ent. The government asked a number of psychologists to develop something for 
this purpose. The result was Army Alpha, the first single, unified group test 
of intelligence. A counterpart, Army Bela, was also constructed for use with 
illiterates and those who could not read well,enough for Army Alpha, which 
presupposed about sixth-grade reading ability. Nearly two million men 
were tested with Alpha during the war, and in subsequent years it was widely 
used in schools and colleges. 

Soon after Army Alpha and Beta were produced, many group tests were 
published, some very closely patterned after the army tests, and none differ- 
ing materially from them. Alpha had been designed for use with adults; 
most of the new tests were made for use with children and adolescents. 

Although Army Alpha is rarely used today, it will be interesting and re- 
vealing to examine it a little more closely. It consists of eight sub-tests or 
parts, each closely timed. The entire test takes about forty minutes to ad- 


minister. Some samples follow: 
Army ALPHA — Form 6 


Test 3. This is a test of common sense. Below are sixteen 
questions. Three answers are given to each question. You 
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are to look at the answers carefully; then make a cross in the 
square before the best answer to each question, as in the sample: 


Sample: Why do we use stoves? Because 
L] they look well 


they keep us warm 
L] they are black 


Here the second answer is the best and is marked with a cross. 
Begin with No. 1 and keep on until time is called. Time: 14 
minutes. 


Test 7. 
Sample: sky — blue : : grass — table green warm big 
fish — swims : : man — paper time walks girl 
day — night : : white — red black clear pure 


In each of the lines above, the first two words are related to 
each other in some way. What you are to do in each line is to 
see what the relation is between the first two words, and under- 
line the word in heavy type that is related in the same way to 
the third word. Begin with No. 1 and mark as many sets as you 
can before time is called. Time: 3 minutes (for 40 analogies). 


The other tests in Army Alpha are Following Directions, Arithmetic, 
Opposites, Scrambled or Disarranged Sentences, Number Series Comple- 
tion, and General Information. The items in each test are arranged in order 
of difficulty, closely timed, and speeded — that is, speed is strongly empha- 
sized. 

Army Bela was designed as a counterpart to Army Alpha, but does not 
require ability to read. It can be given without spoken directions, so it is 
strictly a non-verbal examination. It has the same number of parts as 
Alpha and was designed to parallel in purpose, but pictorially, most of the 
parts or sub-tests of Alpha. Army Bela was not nearly so extensively used 
as Alpha, either during the war or subsequently. 

Although hundreds of group intelligence tests have been developed since 
1917, those which have been successful do not differ in any fundamental 
respect from Army Alpha. Thisis not a criticism of newer tests, but rather is 
a recognition of the basic soundness of this first venture, which, of course, 
was essentially an adaptation of Binet’s ideas to group testing. The im- 
portance of Binet’s contribution to the development of mental tests is 
obvious. Undoubtedly a genius, he established the basic principles and de- 


vised the appropriate techniques which have been followed, without material 
change, to this day. 
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1. Besides those names mentioned in the brief historical sketch just given, others 
have been prominently identified with the early days of this movement. Find five 
of these and give in a sentence or two the chief contribution of each. (Do not 
confine yourself to Americans.) 

2. Binet died in 1911 at the age of fifty-four. Look up his biography and make 
an outline of the main events of his professional life and his major contributions to 
psychology. 

3. Examine a copy of Army Bela along with one of Army Alpha. Seeif you can 
pair the tests in the former with their counterparts in the latter. How would you 
set up an experiment to determine the extent to which this duplication of function 
was accomplished? 


Basic Concepts and Principles of Intelligence Measurement 


Before discussing current procedures in measuring intelligence we should 
consider some of the fundamental ideas and associated problems of this 
field. Most of these problems are theoretical and do not affect to any great 
extent the actual use of the tests. Nevertheless, it is important for the 
user to understand this theory so that he can be more realistic in his ap- 
proach and more aware of the strengths and limitations of the instruments 
he uses. In instances where the topics discussed are controversial issues, we 
have tried to present some of the facts on both sides. 


Theories of Intelligence 

Binet, like many in his field, revised his thinking as he gained more experi- 
ence and witnessed the results of his tests. It will be recalled that at the 
time of the publication of his first scale in 1905 he believed "sound judg- 
ment” to be the central factor. This was in sharp contrast to the views held 
by most of his predecessors and contemporaries, who believed that intelli- 
gence consists of a large number of specifics. The tests of reaction time, 
acuity of vision, hearing, and so on, were consistent with this latter view, 
but they did not prove very useful in distinguishing between feeble-minded 
and normal children, or in identifying those who were considered brighter 
by their teachers, or those who did relatively well in school or college. 

Binet’s definition of intelligence included three factors or capacities: abil- 
ity in thinking (4) to maintain a definite direction or to “stay on the track,” 
(2) to choose appropriate means to ends, or to adapt procedures to goals 
sought, and (3) to objectively evaluate one’s own actions (autocriticism). 
In general, Binet held that the mind is unitary and possesses one overriding 
function which he considered to be effective adjustment to environment. 

More precisely defined theories of intelligence have been advanced by 
various people contemporary with and following Binet. The first of these 
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is the two-factor theory developed by Charles Spearman, an English statisti- 
cian. On the basis of correlation studies, he proposed a theory that intelli- 
gence consists of two factors: a general factor, g, and many specific factors, 
$1, 82, etc. The g enters into all intellectual activities, while each particular 
activity is also subject to one or more specific factors. 

In contrast with the two-factor theory is one which has been referred to as 
the muli-factor theory, the beginnings of which are generally associated with 
the name of E. L. Thorndike.‘ His theory holds that intelligence consists of 
a very large number of specific factors or functions (s; + s» +s +..... Bids 
and casts doubt on the existence of any g, or general factor. This is consist- 
ent with Thorndike’s theory of learning called conneclionism, which holds 
that learning consists of forming connections or bonds between specific 
stimuli and responses and that a person’s learning increases with the 
number of connections formed. Consequently, his degree of intelligence 
would be determined by the ease with which bonds are formed and by. their 
strength and number. 

A third theory, somewhere between these two, is called the group-faclor 
theory, usually associated with the name of L. L. Thurstone, an American 
psychologist and statistician. According to this theory, intelligence con- 
sists neither of g and s, nor of many s's, but of six to ten primary or group 
factors. The six named by Thurstone are number, verbal, space, word 
fluency, reasoning, and rote memory. While none of these factors is com- 
pletely distinct from the others, statistical evidence has been presented to 
justify the assumption that they are not the same. The Thurstone theory 
is, in a sense, a compromise or middle ground between the other two. 

All three of the theories discussed are based on statistical foundations — 
specifically, on intercorrelations between different mental tests. "The the- 
ories, while based on the same kinds of data, illustrate the inevitable fact 
that different individuals will occasionally interpret the same or similar data 
differently. Present-day thinking leans toward the group-factor theory 
as being probably most consistent with the facts, but recognizes the proba- 
bility of the existence of a general factor also. 


The Intelligence Quotient 


Apparently Binet did not come to the concept of the 7.Q. himself, but he 
clearly saw and expressed the mental-age concept at the time of the publica- 
tion of his first scale, and the idea emerged clearly with the 1908 revision. 
A German psychologist, Wilhelm Stern, seems to have been the first to 
formulate the concept of the intelligence quotient, in 1912, a year after 
Binet’s death. The intelligence quotient, or I -Q., as it is generally known, 


4E. L. Thorndike, The Measurement of Intelligence (New York: Teachers College, 
Columbia University, 1927). 
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was quickly adopted by Terman and others, and today is the generally ac- 
cepted means of expressing intelligence, at least of persons below adulthood. 

The T.Q., as has been shown already, is the ratio of mental age or level to 
chronological or life age. Thus, it is reasoned that a child developing nor- 
mally should have a mental age equivalent to his chronological age and 
therefore an I.Q. of 1, or, as it is commonly expressed, 100. The mental age 
or level for any given chronological age up to maturity is determined by test- 
ing a representative sample of children of that age. More specifically, for an 
intelligence test the mental age of nine, for example, is determined by giving 
that test to a representative sample of nine-year-old children. The average 
score in a group test made by these nine-year-olds is the mental-age norm 
for that life age. Subsequently, any child who makes that score is said to 
have a mental age of nine. On the other hand, this same score (and mental 
age) may be earned by children of varying life or chronological ages. Thus, 
John who is twelve may earn a score typical of nine-year-olds; therefore, his 


I.Q. is B X100 — 75. Again, Jean may be ten and if her mental age is 


nine, her 7.Q. is 90; still another child with an M.A. of nine may be only 
seven, in which case his 7.Q. is about 130. Thus, the /.Q. represents the 
ratio of mental age to chronological age, or, put in another way, the rate of 
mental development compared to age. In the first instance above, there is, 
on the average, three-quarters of a year of mental growth per year of life; 
in the second, nine-tenths of a year; and in the third, 1.3 years mental gain 
for every year lived. The same facts may be shown in a different way by 
comparing children of the same age who have different mental ages. 

The I.Q. concept is under fire from various quarters for a number of rea- 
sons. For example, it does not have the unqualified approval of educational 
statisticians because it does not have the same meaning at all levels or points 


on the scale. An J.Q. of 120 may represent the ratio of $ in a five-year- 


old and the ratio of a in a ten-year-old. Obviously, in the first case the 


difference between M.A. and C.A. is one year, in the second, two years. 
However, this disadvantage is inherent in any ratio and applies to the 7.Q. 
to no greater extent than it does to any other. 

Another criticism of the J.Q. concept is that I.Q.’s are not comparable 
from one test to another. This was early pointed out by W. S. Miller? 
who gave ten different intelligence tests to a group of 57 ninth-grade pupils 
and found that there was a range of more than three years in the mean 


5 W. S. Miller, “Variation and Significance of Intelligence Quotients Obtained from 
Group Tests,” Journal of Educational Psychology, 15: 359-66 (September, 1924); “Vari- 
ation of /.Q.'s Obtained from Group Tests," 24: 468-74 (September, 1933). 
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M.A.’s on the ten tests. Thus a child might have an J.Q. of 116 on one of 
the tests and 130 on another. Miller showed that this was largely due to 
differences in the tests themselves and differences in the standardization 
population. He suggested a method of equating such differences by con- 
verting the /.Q.’s to standard scores, as described in Chapter 3. 

Various proposals have been made, ranging from abandonment of the 
1.Q. entirely to replacement with a more adequate measure. In the latter 
case, the suggestions are generally in the form of a deviation score of 
intelligence such as a standard score. While these overcome some of the dis- 
advantages of the T.Q., the deviation scores themselves have disadvantages 
which are not easily remedied. One difficulty is that few existing intelligence 
tests are standardized in this way. Second, most users of the tests have only 
recently become conversant with the /.Q. concept and with the methods of 
determining /.Q.'s from particular tests, and they are not familiar enough 
with other kinds of measures, such as standard scores, to feel secure in using 
them. It is likely that such scores will eventually have wide acceptance 
and use, but it also seems probable that the 7.Q. as a measure of intelligence 
will remain in use for many years, largely because most individual scale, 
based on the Binet scale have been standardized to yield /.Q.'s and nothing 
else. 

One other aspect of the 7.Q. concept, also in the nature of a limitation, 
should be mentioned here. Experience has shown that the functions or 
capacities measured by existing intelligence tests reach their maximum 
about the same time as physical maturity is reached, which seems to be 
somewhere between the ages of fifteen and twenty, as reported in various 
Studies. In some respects this would appear to be quite consistent, since 
we do not expect an individual to grow taller or to acquire more teeth or 
Stronger eyesight once he is mature. On the other hand, most individuals 
like to think of themselves as goining in wisdom and stature as long as they 
live, or at least until they grow quite old. Thus, as far as the 7. -Q. goes, if we 


1 M.A. SALIS A 
continue to use GA. and the individual's M.A. ceases to increase after a 


MA. were T the 
T.Q. would be 100. If the M.A. was still 16 when the C.A. was 20, the ratio 
then would be .80, or, 1.Q. = 80. This, of course is an obviously erroneous 
conclusion, and certainly a demoralizing one. Consequently, it is custom- 
ary with subjects past the age of sixteen not to calculate I .Q.’s at all, but. to 
express the individual's position relative to others in his group, either by 
percentile ranks or by standard scores. This is the procedure followed al- 
most entirely with adult groups such as college and university students, 
men and women in military service, and so on. In effect, this procedure re- 


certain age, the J.Q. value declines. To illustrate, if 
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sults in an equivalent of the 7.Q., since we can assume that these persons 


have reached maturity. Therefore, the denominator of the yA ratio 


is a constant factor and the score or percentile rank is really a measure com- 
parable to the /.Q. rather than simply a measure of rank or relative position. 


Constancy of the I.Q. 


One of the continuing points of issue about the T.Q. concept is the ques- 
tion of constancy. Usually this means constancy from time to time in the 
development of the individual, though it may also mean constancy at differ- 
ent levels or points on the scale. The latter has been touched upon in the 
preceding section as a function of the ratio 7.Q. and is a technical matter 
beyond the scope of this book. The question of constancy in an individual, 
however, is one of direct concern to every user of intelligence tests. In es- 
sence, this question is, “If I test a child today and he has an I.Q. of 90, 
what is the result likely to be if I test him again a year or more from now? p 
This is probably what most persons have in mind when they discuss the 
constancy of the /.Q. 

We have already noted the fact that 7.Q.'s obtained by use of different 
tests differ even when such tests are given within a few days of each other. 
Such differences have been shown to be, in large part, inherent in the tests 
themselves. But let us assume the same test has been given to the same 
children at intervals. What do such experiments show about constancy? 

Numerous studies with the Stanford-Binel administered to large numbers 
of cases at intervals up to several years have shown an average change of 
about five points or less, and average correlations between first and second 
testings of about .85 under ordinary conditions. On the other hand, some 
studies reveal that rather substantial changes upward can be brought about 
when children are transferred from an impoverished environment, such as 
an orphanage, to a good home with rich cultural advantages. In such cases, 
changes on the positive side of as many as 30 points of I. .Q. have been re- 
ported. 

Taken altogether, the evidence can be summed up as follows: for the 
great majority of children, the 1.Q.’s based on a good test of intelligence 
properly administered and accurately scored will be relatively accurate and 
stable. Upon retesting under similar conditions with the same test, half 
of them will change five points or less, more than 80 per cent of them by 
not more than ten points, and 95 per cent by fifteen points or less. Perhaps 
5 per cent of the children will show changes of more than fifteen points. In 
other words, the probable error of a single 7.Q. test has been found on the 
Binet scale to be approximately five points. Approximately the same prob- 
able error would be present in a good group test of intelligence, though the 
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variability might in some cases be slightly greater In the second place, it 
has been reliably reported that substantial changes in tested /.Q. can be 
produced by radical improvements in environment. Although these 
changes and improvements are greater than most children are likely to 
experience, the possibility must not be overlooked. 

It should be emphasized that a single test, either individual or group, 
constitutes only a very limited sampling of a child's behavior. Because of 
this fact, the test should always be regarded as provisional, and should be 
supplemented by additional tests whenever possible. It should be pointed 
out in this connection that many schools include in their testing programs 
at least two and often three to five testings of intelligence between Grades 
1 and 12. It should also be mentioned that /.Q.'s obtained between 
the ages of six and sixteen are much more stable, generally speaking, than 
those obtained with children below six. Younger children tend to be diffi- 
cult to control and more restless since their attention span is shorter; for 
this reason tests given to younger children produce less reliable results 
than those given to older children. "This strongly suggests the desirability 
and even the necessity for further testings when some of the tests have 
been given below the age of six, 


Distribution of Intelligence Quotients 


Since there is no absolute standard of intelligence or of mental age, the 
basis for these must be relative. In other words, a mental age of twelve 
years is determined by what a sampling of children representative of the 
total population of twelve-year-olds can do on a given test. As has been 
pointed out, this varies from one test to another because of differences in 
the tests themselves and differences in the population sample. Neverthe- 
less, distributions both of raw scores and 7.Q.'s obtained from adequate pop- 
ulation samples quite generally show results closely approximating the 
normal curve. Onesuch distribution for the Revised Stanford-Binet is shown 
in Figure T. This is based on a composite 7.Q. on Forms L and M of the 
1937 Revision for 2,904 subjects, ages two to eighteen, inclusive, 

The close approximation to the normal curve in the form of the distribu- 
tion of 1.Q.’s is of considerable importance. If I .Q.’s obtained from intelli- 
gence tests did not yield this kind of distribution from adequate popula- 
tion samples, there might be a basis for doubting the validity of such tests, 
or at least the validity of the particular ones used. The basic assumption 
is that intelligence is normally distributed in the general population, and 
tests which do not yield results at least approximating the normal might be 
open to serious question. Also, this assumption provides a sound theoretical 


* Claude L. Nemzek, “The Constancy of the 7. -Q.,” Psychological Bulletin, 30: 154 
(February, 1933). MY» 
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Figure 7 
Distributions of Composite L-M 1.Q.’s of Standardization Group 


AGES 2 TOI8 


PER CENT 


35- 45- 55- 65- 75- 85- 95- 105-115-125-135- 145- 155- 165- 
44 54 64 74 84 94 104 114 124 134 144 154 164 174 


I.Q. 


(From Lewis M. Terman and Maud A. Merrill, Measuring Intelligence, Boston: Houghton 
Mifflin Company, 1937. Reproduced by permission of the authors and publisher.) 


basis for interpreting the significance of any given /.Q. in comparison with 
the proportion of the general population that has the same 7.Q. 

When the normal distribution of J.Q.’s is broken up according to the 
areas under the curve, and a mean of 100 and a standard deviation of 16 are 
used as the constant values (based on Slanford-Binel I .Q. distributions), 
the proportions of various levels of /.Q. are as shown in Table VII, page 250. 

The classification of various I.Q. levels as shown in this table is generally 
accepted, though the limits are not to be regarded as exact. At the lower 
end of the distribution the mentally defective are classed as morons, im- 
beciles, and idiots, in descending order; those whose I.Q.'s are below the 
lowest limit in the table are usually classed as idiots. 


Heredity versus Environment 

Closely related to the question of the constancy of the 7.Q. is the question 
of the relative influence of heredity and environment. "This controversy 
has raged for years with equally strong support on both sides of the argu- 
ment. To a considerable degree the argument has been implemented and 
intensified by the development of intelligence tests. The widespread appli- 
cation of Army Alpha and its successors has supplied immense amounts of 
data, making possible comparisons among different racial, socio-economic, 
and cultural groups or levels. Implicit in such comparisons, of course, is 
always the question of to what extent the differences among these groups 
are due to heredity or environment, or both. 
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Classification and Proportions of Various Levels of /.Q. in the General 
Population 


Percentage in 


Classification LQ. General Population 


Very superior 140 and above 


z 130-139 
EE: f 20-129 
High average 110-119 
100-109 
{ 90-99 
Low average 80-89 
Borderline defective 70-79 
60-69 
50-59 
40-49 
30-39 


Normal average 


Mentally defective 


(Adapted from Maud A. Merrill, “Significance of I.Q.’s on the Revised Stanford-Binet 
Seales,” Journal of Educational Psychology, 29: 641-51, December, 1938.) 

The debate on the relative influences of heredity and environment, while 
often apparently useless, is nevertheless concerned with a question of fun- 
damental importance to educators and psychologists. Obviously, if a 
child’s capacity for accomplishment were determined solely by inherited 
traits or abilities, education necessarily would have quite a different outlook 
and philosophy than it would in a less deterministic frame of reference. 

Many investigations and experiments have been made on this question 
from the time of Galton to the present. One method of investigation in- 
volves studies of persons who are obviously brilliant, feebleminded, or de- 
generate, to determine whether such traits tend to run in families. Most 
of these studies show that they, do. Even with this knowledge, however, 
the question of whether such characteristics are due more to inheritance 
or to environment is not satisfactorily answered. 

Another type of investigation involves the transplanting of children 
from a poor environment to a good one. The results of such studies are 
equivocal: some children show general and marked improvement in intelli- 
gence and other traits, others do not. One is almost forced to conclude, 
therefore, that investigators tend to find the results they look for; much of 
the research showing marked changes has been severely criticized for poor 
control, careless procedures, or bad statistical treatment of data. 

A third type of investigation uses twins as subjects. Identical twins are 
as nearly alike in heredity and environment as two humans can be. Start- 
ing with a large number of pairs of twins, the investigator studies many 
aspects of their resemblances — physical, mental, and personality. Most of 
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these studies show remarkable and, in some instances, almost fantastic simi- 
larities among pairs of identical twins. In the few cases where studies 
have been made of identical twins reared in different environments, the 
resemblances have been less close, especially in mental and social traits. 

In summary, it may be said that the heredity-environment issue is far 
from settled. During the last hundred years there appears to have been a 
gradual shift from a viewpoint strongly hereditarian to one more favorable 
to the environmentalist point of view. Of course there are extremists who 
would rule out entirely one or the other point of view, but most educators 
and psychologists, believing that. what the individual becomes is the result 
of an interaction of heredity and environment, take a middle-of-the-road 
view on this question. As one writer has put it, “ The great mathematician, 
Sir Isaac Newton, if he had been brought up among African bushmen would 
probably have become a remarkable bushman but he would never have dis- 
covered the laws of motion.” Similar statements might be made about a 
Mozart, an Einstein, or an Edison. In every instance, environmental op- 
portunities were necessary to bring out a remarkable native endowment. 
On the other hand, it is improbable that any conceivable environment could 
make an Einstein, a Mozart, a Newton, or an Edison of a child having no 
special talents or unusual endowments. 

Recent studies’ suggest that cultural factors may enter into perform- 
ance on intelligence tests more than had been supposed. Indeed, it is main- 
tained that existing group tests penalize children who come from poor 
homes and whose cultural patterns, parental attitudes, and group standards 
are not the middle-class ones which are said to dominate school and test- 
ing situations. Experimentation and studies are now being carried on to 
investigate this hypothesis further. 


e Learning Exercises 9 


4. Solve the following using the formula 7.Q. = MA x 100: 
M.A. = 1; C.A. = 10 IQ. = 
M.A. = 12; I.Q. = 125 G.A. = 
C.A. = 10; I.Q. = 90 M.A, = 


M.A. =10-8;C.4.=9-4 IQ. = 
5. Using the following data, calculate the proportion of the total population with 
1.Q.’s above 116, below 84. 
M = 100 o=16 
(Note: Refer to Chapter 3 and Appendix A.) 


7 Kenneth W. Eells, Allison Davis, Robert J. Havighurst, Virgil E. Herrick, and Ralph 
Tyler, Intelligence and Cultural Differences (Chicago: University of Chicago Press, 1951), 
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6. Plan an experiment with guinea pigs to show the effects of differences in 
heredity under the same environmental conditions. Plan another to measure the 
effects of changed environment with heredity constant. How would you measure 
results? 


Current Procedures in Measuring Intelligence 
Verbal and Non-verbal Material 


The content of intelligence tests is generally classified as verbal or non- 
verbal. Verbal material requires the ability to read or, at the very least, 
to understand material spoken or read aloud by the examiner. Therefore, 
a strictly non-verbal examination is one that requires no use of language in 
its administration or in the pupils’ answers. Such a test was referred to 
earlier in the case of Army Beta, where it was stated that the directions for 
that examination could be given in pantomime. Tests are rarely non- 
verbal to this extreme; indeed we generally consider an intelligence test 
non-verbal as long as it does not require any reading ability. This type of 
test is useful and quite generally necessary in testing young children, illiter- 
ates, and the feebleminded. 

While non-verbal material may be pictorial, it may also be of the type 
known as "performance." Formboards, which are a kind of jigsaw puzzle, 
building blocks, and tracing mazes are typical of the performance type of exer- 
cises found in individual intelligence tests. Samples are shown in Figure 8. 

In group tests the non-verbal material is generally pictorial, as will be 
illustrated later in this chapter. 

While the correlation between verbal and non-verbal types of tests of in- 
telligence is far from perfect, it is generally high enough to justify the use 
of non-verbal material where verbal tests cannot be used. This is found 
most often in tests for young children. Non-verbal material of the perform- 
ance type is also used extensively in tests of aptitude. 


Individual Scales 


A brief description has already been given of the procedure for adminis- 
tering a Stanford-Binet examination. The method is essentially the same 
in others of the Binet type, such as the Kuhlmann, except that the latter 
takes both accuracy and speed of response into account. It will be recalled 
that one of the criticisms of the Stanford-Binet (1916) was lack of adequate 
“ceiling” or “top” for testing adults. Although this was remedied to a 
large extent in the 1937 Revision, there was still some dissatisfaction on 
this point. 

Another frequently voiced criticism of the Binet scale and its revisions 
was that the scales were designed for use with children, and the adult ma- 
terials added later were really not essentially different. Consequently, it 
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Figure 8 
Examples of Typical Performance Tests 


A. Porteus Maze- Test. B. Knox-Kemp Feature Profile Test: 
Pintner-Patterson Modification. C. Minnesota Rate of Manipulation 
Test, D. Seguin-Goddard Formboard. 

(Reproduced by permission of C. H. Stoelting Company, publisher.) 


was said that the scales were not appropriate for adults and not intrinsically 
interesting to them. To meet these criticisms a scale for measuring adult 
intelligence was brought out by Wechsler in 1939.* This was an individual 
intelligence scale known as the Wechsler-Bellevue Intelligence Scale. It 
departs in certain important respects from the Binet-type scale. The tests 
are not grouped by age levels and the scale yields standard score or devia- 
tion I.Q.’s. It was designed for and standardized largely on adults, but at- 


| 8 David Wechsler, The Measurement of Adult Intelligence (Baltimore: Williams and 
Wilkins, 1939). 


254 The Measurement of Capacily 


tained great popularity for use with children as well. In 1949 Wechsler 
published another scale for children along the same general pattern as the 
adult scale.? More recently he has published a revised and improved Adult 
Intelligence Scale.° These scales, together with the Revised Stanford-Binet, 
are by far the most widely used individual intelligence tests and are regarded 
as practically standard in all work where individual tests are used. 


Typical Group Tests of Intelligence 


There are many group intelligence tests available for use at all levels of 
human development from kindergarten to adult. Since it would be impos- 
sible to make a complete and up-to-date listing of such tests, to say nothing 
of describing them in detail, only a small number of illustrative ones will 
be discussed at length, as was done previously with achievement tests. 
Reference to catalogs of test publishers and to the various editions of The 
Mental Measurements Yearbooks will identify many other intelligence tests 
for those interested. The first one to be described is among the best-known 
and widely used group tests. 


Kuhlmann-Anderson Intelligence Test 


1. Names of test and authors. Kuhlmann-Anderson Intelligence Test.“ 
Fred Kuhlmann and Rose G. Anderson. 


2. Nature and purposes. A series of 39 tests in nine overlapping groups 
designed to measure intelligence. 


3. Grade level. Kindergarten to adult. 
4. Number of forms, One form. 


5. Publisher and date of publication. The first five editions were pub- 
lished by The Educational Test Bureau, 1927-1942. The Sixth Edition was 
published by the Personnel Press, Inc., 1952. 


6. Cost. $2.70 per 25. 


7. Content. The 39 separate sub-tests contain exercises of various types. 
The first 17 are non-verbal requiring no reading. For example, in Test 3 there 
are rows of pictures of familiar objects in each of which the child is told to find 
two things that are alike in some way, the nature of the resemblance or likeness 
being specified. ' Thus, in the first row (page 255) he is told to mark the two 
that are good to eat; in the second, the two to play with; in the third, two 
things to cook with, and so on. 

Test 18 and most of the remaining ones are verbal, requiring the reading of 
letters, words, or numbers. For example, Test 26 is as follows: In each row 
there is a key word followed by five or six other words. The task is to find 


i ° David Wechsler, Wechsler Intelligence Scale for Children (New York: Psychological 
Corporation, 1949). 

10 s Wechsler. Adult Intelligence Scale (New York: Psychological Corporation, 1955). 

11 Quotations in test description by permission of the Personnel Press, Inc., publisher. 
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among these five or six words two things which the first one is never without. 
A table, of course, is never without a top or legs; a tree is never without roots 
or branches; a book is never without pages or printing, and so on. 


Examples: 
table top paint legs cloth dishes 
tree shade nuts roots leaves branches 
1. book story pages shelf picture printing 
2. squirrel nuts fur tail cage tree 


Each of the 39 parts or sub-tests is separately timed. The first ten tests 
are for kindergarten and the first half of Grade 1; Grade 1, second semester 
takes Tests 4 to 13, inclusive; Grade 2, Tests 8 to 17, inclusive, and so on. 
Each successive grade takes some of the same tests as the preceding and the 
following grades, yet no two grades take exactly the same tests. All of the 
nine levels except two take 10 tests each. This offers an advantage over sepa- 
rate tests for different grade levels in that there is a high degree of continuity 
and comparability from grade to grade. 


8. Time required. About 50 minutes. (At the lower levels an hour 
should be allowed, including time for instructions and breaks.) 


9. Directions for, and ease of, administering. Each sub-test has 
separate directions and timing. The tests are more complicated to administer 
than many present-day group tests, yet directions are clear and complete. 


10. Validity. The fundamental criterion in earlier editions of this test is 
chronological age. The tests are chosen on the basis of size and consistency 
of increase in scores for successive age levels. Correlations with school marks 
and similar evidence were rejected because these were said to be influenced by 
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qualities other than intelligence. In the last edition validity is based on the 
ability of the tests to differentiate between various levels of mental develop- 
ment, which they are reported to do very well. 


ll. Reliability. In earlier editions it was stated that the tests are reliable 
because (a) they are adjusted in difficulty to the level of mental development at 
the age for which each test is intended; (b) directions are adequate, thus reduc- 
ing variation in scores due to faulty administration; and (c) the median mental- 
age method of scoring is used. In the last edition reliability (split-half) is about 
.90. The standard error of a score (J.Q.) is 5.5. 


12. Manual. Each set of ten tests for a given level is put up in a separate 
booklet and is accompanied by a separate manual for that booklet. In addition, 
there is a Master Manual containing copies of all 39 tests, background data, 
and directions for administering, scoring, and interpreting. 


13. Scoring. Answers are marked on test booklets; no separate answer 
sheets have been developed. Scoring is fairly simple, but ranges from objective 
to somewhat subjective, particularly on the non-verbal tests. Each test ina 
booklet is scored separately so that each child gets ten separate sub-test or 
part scores. These scores are converted to mental age equivalents which in 
turn are located on a table of mental age equivalents. The median (between 
the fifth and sixth scores in order of size) is the one on which the mental age 
score of the individual is determined. This method has the advantage of reduc- 
ing the effect of a very high or a very low score on the individual’s mental 
age score. The scores on each of the ten tests may also be shown ona profile 
graph. 


14. Norms. While the tests were being developed, they were tried out on 
as many as 30,000 subjects. Original norms were based on at least 350 school 
children at each age level. Beginning with the Fourth Edition some revisions 
were made in the norms as a result of testing about 5,000 school children in 
Pennsylvania and New Jersey. Norms are extremely simple to use, although the 
data on which they are based are not provided except as noted above. 


15, Format. The printing of the test booklets and manuals has been im- 
proved in the Sixth Edition. Except for this, they remain unchanged in format 
through the various editions. The booklets are small and on the whole do not 
equal the best tests now or the market. Nevertheless, these tests have been 
among the most widely used intelligence tests for a quarter of a century. 


The second group test of intelligence to be described is one that has also 
been very widely used in recent years. Although it differs markedly in its 
conception, plan, and general organization from the Ki uhlmann-Anderson, 
it has substantial merit and is one of the leading tests in this field, 


California Test of Mental Maturity 


1. Names of test and authors. California Test of Mental Maturity.” 
Elizabeth T. Sullivan, Willis W. Clark, and Ernest W. Tiegs. | 


? Quotations in test description by permission of the publisher. 
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2. Nature and purpose. It is stated in the test manual that this is a 
“diagnostic test of mental maturity. Its primary purpose is to make for each 
individual a diagnostic evaluation of those mental abilities which are related 
to, or determine, his success in various types of activities, in order that the 
teacher or employer may utilize this information directly in aiding him when 
he is experiencing learning difficulties." Little factual evidence is presented in 
the manual cr elsewhere in support of the diagnostic values claimed for the test. 


3. Grade level. Pre-primary, K-1; Primary, 1-3; Elementary, 4-8; Inter- 
mediate, 7-10; Advanced, 9-Adult. 


4. Number of forms. One. 


5. Publisher and date of publication. California Test Bureau, 1936- 
1951. 


6. Cost. Complete, $.14 per copy; short form, $.08 per copy, all levels, in 
packages of 35. Answer sheets for elementary, intermediate, and advanced, 
$.07 per copy, Scoreze; $.04 per copy, standard. 


7. Content. The test comes in two editions, complete and short form. 
The complete edition includes eleven tests at the Pre-primary and Primary 
levels, and twelve at the three upper levels. These are named (1) Immediate 
Recall, (2) Delayed Recall, (3) Sensing Right and Left, (4) Manipulation of 
Areas, (5) Opposites, (6) Similarities, (7) Analogies, (8) Inference, (9) Number 
Series (in three upper levels), (9) Number Concepts (in two lower levels), (10, 
11) Numerical Quantity (two at three upper levels, one only at Pre-primary 
and Primary), and (11 or 12) Verbal Concepts. 

The tests are said to measure the following: Tests 1 and 2, Memory; 3 and 4, 
Spatial Relationships; 5, 6, 7 and 8, Logical Reasoning; 9 and 10, or 9, 10 and 
11, Numerical Reasoning; and 11 or 12, Verbal Concepts. All of these are re- 
garded as separate mental factors. 

A further grouping of sub-tests is made to yield a Language and a Non- 
language score and T.Q. Tests labelled Delayed Recall, Inference, Numerical 
Quantity, and Verbal Concepts are called the Language Factors; the remaining 
sub-tests are called the Non-language Factors. The language tests are pre- 
sented in language form; the non-language with a minimum use of language. 

The short form includes seven sub-tests at each level in the areas of Spatial 
Relations, Logical Reasoning, Numerical Reasoning, and Verbal Concepts. 
These also are grouped into Language and Non-language sections. 


8. Time required. The tests are said to be power tests rather than speed 
tests, but it is recommended that all time limits be observed. The prescribed 
limits are said to be ample for pupils to reach the practical limits of their abil- 
ities, Actual working times given in the manual for the complete form are these: 


Pre-primary 50 minutes Intermediate 88 minutes 
Primary 69 minutes Advanced 90 minutes 
Elementary 84 minutes 

For the short form working times range from 20 to 52 minutes. 


9. Directions for, and ease of, administering. Directions for admin- 
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istering are clear and complete. Language and Non-language sections may 
be given individually or together. There is a substantial amount of reading 
of directions aloud by the examiner, which, plus time for passing out and 
collecting papers, answering questions, etc., increases the actual times given 
above by approximately fifty per cent. The tests require some preparation 
and practice before they can be properly administered, although no unusual 
technical competence is required. 


10. Validity. Evidence of validity rests principally upon a comparison of 
this test with the Stanford-Binet. However, the manual does not specify which 
Stanford-Binet test is used as a comparison, or give any correlations of scores on 
the California Test with the Stanford-Binet. 


1l. Reliability. Coefficients for sub-tests calculated by the split-half 
method and corrected by the Spearman-Brown Formula are quite high. They 
range from a low of .80 for Verbal Concepts in Grade 1 to .92 for Numerical 
Reasoning in Grades 7 to 9 and with 100 adults. Reliabilities for the total 
tests range from .93 to .95. Standard errors of measurement range from about 
3 to 5 points of T.Q. 


12. Manual. The manuals are complete; they contain a wealth of informa- 
tion about using and interpreting the tests. 


13. Scoring. This may be done in three ways. First, answers may be 
marked on test booklets and scored with printed strip keys. Second, Scoreze 
answer sheets may be used with the upper three levels. By separating the 
answer sheet and the carbon paper backing, scoring is done very simply by 
scanning the back of the answer sheet. Third, standard answer sheets may be 
used and scored manually with a stencil, or by machine. In every case the 
score is the number right. 

Three scores are obtainable for each individual. These are total score (Total- 
Mental Factors), the Language Factors, and the Non-language Factors. These 
also yield three corresponding 7.Q.'s. 

Scores on each sub-test, on the five mental factors, and on the total may be 
placed on a profile chart. 


14. Norms. Each level wàs standardized on a stratified sampling of 
25,000 cases. This also constituted the normative population. The norms are 
of three types — mental age, grade placement, and percentile. The percentile 
norms are for different ages and educational levels. "The original norms have 
subsequently been checked against more than 100,000 additional cases. 


15. Format. The tests, manuals, and accessories of the 1951 Edition are 
among the most attractive tests on the market today. 


"The third type of intelligence test to be described here is one that is un- 
usual and yet representative of a number of attempts to construct a test that 
is relatively “culture fair.” Many persons have been concerned with the 
possibility that scores on existing intelligence tests are affected by the cul- 
tural and educational opportunities of the individual, as we have mentioned 
earlier in this chapter. The basic assumption in conventional tests is that 
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the material or content used is novel to practically all children being tested, 
or that the children have had an equal opportunity to learn it. From a test 
construction standpoint, it is more desirable to use material which is novel 
toa majority of the children, yet it is most difficult to attain this objective in 
any absolute sense. It seems safe to say, therefore, that the best current 
intelligence tests have attempted to provide materials and content which 
most school children have had ample opportunity to learn, and most tests 
have succeeded rather well in this. Thus, the degree to which the child has 
learned that which is within the common experience of children of his age 
group can be taken as a reflection of his ability to learn, that is, his intelli- 
gence. 

In the work of Eells and his associates, cited earlier, it was found that 
some types of content in current intelligence tests seem more affected 
than others by differences in cultural and socio-economic factors. Am at- 
tempt was made to identify types of material least affected in this way and 
to construct a test which would be little affected by such differences. How 
successful this venture has been is still not established beyond question, 
but preliminary data seem quite promising. 


Davis-Eells Test of General Intelligence or Problem Solving Ability 
1. Names of test and authors. Davis-Eells Test of General Intelligence or 
Problem Solving Ability Allison Davis and Kenneth Eells. 
2. Nature and purposes. To construct a test of intelligence, free of 
reading demands, based on common experiences shared by all urban American 


children. The test is said to consist of realistic problems in the experience of 
all children and is entirely pictorial except for directions read aloud by the 


administrator of the test. 
3. Gradelevel. Primary, Grades 1 and 2; Elementary, Grades 3 through 6. 
4. Number of forms. One: Form A. x 
5. Publisher and date of publication. World Book Company, 1953. 


6. Cost. Primary, $4.00 per 35; Elementary, $4.45 per 35. 


7. Content. One part is called “Best Ways" problems. In each problem, 
three pictures are presented. Each picture shows a person or group of persons 
starting to solve a problem in a different way. The task is to find the picture 
that shows the best way of solving the problem. Example: 


13 Quotations in test description by permission of the publisher. 
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Another type of problem is called “Analogies.” This consists of pictorial 
situations like the example shown below. 


In each case the relationship between the first pair is made clear or suggested 
by the examiner before the child finds the answer. 

A third type of problem is one called “Probabilities.” In each case a picture 
shows a situation which is followed by three possible explanations. The task 
is to select the most likely explanation. 


No. 1: The man fell down and hit his head. 

No. 2: A ball came through the window and 
hit the man’s head. 

No. 3: The picture does not show how the 
man got the bump on his head. 
Nobody can tell because the pic- 
ture doesn't show how the man got 
the bump. 

Which number was true? 


These three types of problems are found both in the Primary and the Ele- 
mentary tests. In addition, the Elementary test contains some problems 
called Money Problems. In these the pupil is required to indicate how a given 
amount of change could be made from certain available coins. For example, 
in the problem below the task is to select from the three pictures the one in 
which thecoins on theright could be obtained from the group of coins on the left. 
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8. Time required. Primary, Grade 1: two 30-minute periods. Grade 2: 
three 30-minute periods. Elementary: two 60-minute periods. 


9. Directions for, and ease of, administering. Directions for admin- 
istering are very detailed and complete. Administration is comparatively com- 
plicated and time-consuming. Most of the time required to give the tests is 
consumed in reading directions and explanations. 


10. Validity. The content of the tests is said to be independent of reading 
skill, in-school instruction, or speed of response. The test purports to be a 
measure of “over-all capacity to solve mental problems,” and not a scholastic 
aptitude test. These problems are chosen as being of a kind which are encoun- 
tered by most children. Correlations of the Davis-Eells Test with Otis Quick- 
Scoring Mental Ability Tests range from .39 to .66, with a median of .52. Corre- 
lations with scores on standardized achievement tests in reading, arithmetic, 
language, and spelling are in the neighborhood of .40. 


1l. Reliability. Split-half reliability coefficients corrected by use of the 
Spearman-Brown Formula average about .83 in Grades 2 through 6; for Grade 1 
it is 68. The standard error of measurement of a score ranges from 2.5 to 3.5. 
Test-retest coefficients with an interval of two weeks were approximately .70 in 
Grade 2, and .90 in Grade 4. 


12. Manual. The tests are accompanied by Directions for Administering 
which include directions for scoring, tables for converting raw score to an Index 
of Problem Solving Ability (J.Q.’s), and percentile equivalents. Information 
concerning development of the test, validity, reliability, and other statistical 
data are available in a separate manual. It would be useful to have at least the 
most important of such information in the booklet accompanying the test. 

13. Scoring. Scoring is quite simple. All items are of the three-response 
type, the pupil marking the number of the choice that seems best to him. Since 
there are only 47 items in the Primary form and 62 in the Elementary, and 
since these are widely spaced and in large print, it is easy to score the test. No 
separate answer sheets are provided, though they could be used, at least at the 
upper grade levels. The score is the number right. Printed scoring keys are 
provided. The raw score is converted to an IPSA (Index of Problem Solving 
Ability) by use of the tables previously referred to. Ages are used to the nearest 
half-year, which gives an approximate value for the IPSA. The IPSA is based 
on a normalized distribution with a mean of 100 and a standard deviation of 16. 
The authors state that this Index of Problem Solving Ability may also be called 
an I.Q. 

14. Norms. Means, medians, and standard deviations of raw scores are 
given for age groups by three-month intervals from 6-0 to 8-5 in Grade 1; from 
7-0 to 9-5 in Grade 2; from 8-0 to 13-11 in Grades 3 to 6. 


15. Format. The tests are well arranged and printed. Figures are large 
and spacing is generous. 


Other Intelligence Tests 
INDIVIDUAL 
1. Arthur Point Scale of Performance Tests. 1947. Ages 5-15. Knox Cube 
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Test, Seguin Form Board (Arthur Revision), Arthur Stencil Design Test, Porleus 
Maze Test (Arthur Printing), Healy Picture Completion Test IT. - 


Revised Form II. Complete set, with manual and 100 score sheets, $66.00. 
60-90 minutes. 
Psychological Corporation. 


2. Leiter International Performance Scale. 1948. Ages 2-18. Fifty-four 
tests involving matching of colors and of objects, picture completion, spatial 
relations, footprint recognition, etc. 

One form, revised. $82.50, plus $3.00 per 100 record cards, plus $3.00 for 
manual. Carrying case, $8.00. 30-60 minutes. 

Psychological Service Center Press. 


3. Minnesota Preschool Scale. 1940. Ages 1.5 to 6, inclusive. Twenty-six 
tests: Verbal includes pointing to and naming objects; comprehension, naming 
colors, etc. Non-verbal includes copying figures, Knox Cube Test, paper-fold- 
ing, etc. 

Forms A and B. $18.00 per set. 10-30 minutes. 

Educational Test Bureau. 


4. Wechsler-Bellevue Intelligence Scale. 1939, 1946. For adults. Informa- 
tion, general comprehension, arithmetic reasoning, memory span for digits, 
similarities, vocabulary, picture arrangement, picture completion, block 
design, object assembly, digit symbol. 

Forms I and II. Form I, $17.50 per set; manual, $3.60, Form II, $19.00 per 
set; manual, $2.25. Record blanks, $1.60 per 25. About one hour. 

Psychological Corporation. 


5. Wechsler Intelligence Scale for Children. 1949. Ages5-15. Same testsas 
in Wechsler-Belleoue with addition of optional maze test and coding in place of 
digit symbol. 

One form. $22.00 per set, including manual. Record forms, $2.00 per 25. 
Maze Test, $1.20 per 25. About one hour. 

Psychological Corporation. 


6. Wechsler Adult Intelligence Scale. 1955. Ages 161075. Same tests asin 
Wechsler-Bellevue, Form I, revised and re-standardized. 

One form. $21.00 per set, including manual. Record forms, $1.60 per 25. 
About one hour. 

Psychological Corporation. 


Group 


1. Academic Aptitude Test — Verbal. 1944. Grades 7 to adult. Academic 
and general science; comprehension, judgment, arithmetic reasoning, logical 
selection, analogies, classification. 

One form. $2.75 per 25. 40 minutes. 

Acorn Publishing Company. 


2. Academic Aptitude Test — Non-verbal. 1944. Grades 7 to adult. Spatial 
relations, physical relations, graphic relations. 

One form. $2.75 per 25. 28 minutes. 

Acorn Publishing Company. 
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3. American Council on Education Psychological Examination for College 
Freshmen. 1954. Linguistic and quantitative sections. 

Forms: 1947, 1948, 1949, 1952, 1954. $2.95 per 25. Answer sheets, $1.00 per 
25. 60 minutes. 

Educational Testing Service. 


4. American Council on Education Psychological Examination for High 
School Students. 1953. Linguistic and quantitative sections. 

Forms: 1946, 1947, 1948, 1953. $2.75 per 25. Answer sheets, $1.00 per 25. 
55 minutes. 

Educational Testing Service. 


5. Cooperative School and College Ability Tests. 1955. Grades 10-14 now 
available. Sentence completion, numerical computation, vocabulary, numeri- 
cal problem-solving. 

Forms 1A, 1B (1C, 1D restricted), college; Forms 2A, 2B, high school. Test 
booklets, $3.25 per 25; answer sheets, $1.25 per 25. 100 minutes. 

Educational Testing Service. 


6. Henmon-Nelson Tesis of Mental Ability. 1950. Elementary, Grades 
3-8; High School, Grades 7-12; College. Synonyms, analogies, number se- 
quence, arithmetic, etc. 

Forms A, B, and C for elementary and high school; A and B for college. 
$2.10 per 35. 30 minutes. 

Houghton Mifflin Company. 


7. Lorge-Thorndike Intelligence Tests. 1954. Kindergarten through Grade 
12. Primary: oral vocabulary, cross out and pairing of pictures. Nonverbal: 
figure analogies, figure classification and number series. Verbal: word knowl- 
edge, sentence completion, verbal classification, verbal analogies, and arithmetic 
reasoning. 

Forms A and B; five levels, Consumable or Re-usable Editions. Primary, 
Levels 1 and 2, $3.00 per 35; all others, verbal or nonverbal, $2.40 per 35. 
Levels 1 and 2, untimed (about 20 minutes); levels 3 and 5, verbal, 34 min- 
utes; nonverbal, 27 minutes. 

Houghton Mifflin Company. - 


8. Olis Quick-Scoring Mental Ability Tesis. 1939. Alpha, Grades 1-4; Beta, 
Grades 4-9; Gamma, High School and College. Alpha consists of pictorial 
items, Beta and Gamma of verbal material. 

Forms: A-S of Alpha, Em and Fm of Beta and Gamma. Alpha, $2.50 per 35; 
Beta and Gamma, $2.60 per 35. Alpha, 25 minutes; Beta, 30 minutes. 


World Book Company. 


9. S.R.A. Primary Mental Abilities. 1948-50. Ages 5-7, 7-11, and 11-17. 
Verbal meaning, quantitative, space, perceptual speed, and motor for 5-7; 
verbal meaning, space, reasoning, perception, and number for 7-11; verbal 
meaning, space, reasoning, number, and word fluency for 11-17. 

One form for each level. Tests for Ages 5-7, $3.00 per 20; 7-11 and 11-17, 
$.49 per booklet. Tests for Ages 5-7, 60-80 minutes; 7-11, 60 minutes; 11-17, 
26 minutes. 

Science Research Associates. 
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10. Terman-McNemar Test of Mental Ability. 1941. Grades 7-13. Infor- 
mation, synonyms, logical selection, classification, analogies, opposites, best 


answer. 
Forms C and D. $3.20 per 35. 40 minutes. 
World Book Company. 


e Learning Exercises © 


7. If you were responsible for choosing a group intelligence test for use with fifth 
and sixth grades, how would you proceed? Describe the steps involved and justify 
each one. 

8. Intelligence tests frequently include tests of vocabulary (same-opposites, 
word meaning, etc.) and of numerical problems, There are also achievement tests 
in the same subjects or areas. How do you account for this? 

9, Why is an individual examination like the Binel generally regarded as more 
accurate and dependable than a group test? 

10. Compare a group test of intelligence with one of those described above. 
What are their comparative merits? 


THE MEASUREMENT OF APTITUDE 


So far in this chapter we have considered the measurement of general 
capacity only. This has been referred toas mental ability, intelligence, and, 
at times, scholastic aptitude. However, the term aptitude is usually re- 
served for use in connection with capacity in particular fields such as art, 
music, clerical work, and mechanics. Moreover, aptitudes are gener- 
ally regarded as existing independent of training, though they undoubtedly 
are influenced by it. 

The chief purpose of aptitude tests is to predict, or, to put the matter in 
another way, to identify individuals who have the greatest potential for 
development along special lines or who are likely to profit most by special 
training. In validating aptitude tests, scores are frequently correlated with 
performance on the job or in the special field for which the tests were con- 
structed. Thus, a mechanical aptitude test may be given to a group of 
boys beginning shop training. Their scores on the test are subsequently 
checked at intervals against their success in such training, and perhaps ul- 
timately against placement, persistence, and success in mechanical pursuits. 
Or, a test of clerical aptitude may be given to a group of girls starting com- 
mercial training, and the results checked against success or failure in the 
course of training and on the job. 

Aptitude tests are often designed on the basis of job analysis. If a test 
ig to screen applicants for a certain type of work, let us say that of an elec- 
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trician, a detailed analysis of the job of an electrician may be made. On the 
basis of this analysis are determined the specific abilities and skills required 
of a successful electrician. Then a test is constructed to measure these 
skills and abilities objectively and accurately, and the value of the test as 
a prediction of success is determined as described above. Job analysis is 
one of the distinguishing features of the work of developing aptitude tests in 
special fields. In this respect it resembles factor analysis as applied to in- 
telligence tests; indeed, factor analysis has been used in the development of 
some aptitude tests, though the job analysis method is the one usually fol- 
lowed. 

Several kinds of aptitude tests will be discussed and illustrated briefly. 
In the first category are aptitude tests in rather specialized and limited 
types of work. These include tests of mechanical, clerical, musical, and 
artistic aptitudes. The second category consists of a group of aptitude bat- 
teries designed to predict success in a large variety of types of work. Tests 
in the latter group are often referred to as differential or factored aptitude 
tests. 

Since a considerable amount of work has been done in the development of 
aptitude tests in the areas of mechanics, clerical work, music, and art, a 
sample or prototype of each of these will be described briefly. 


Mechanical Aptitude 


Stenquist Mechanical Aptitude Tests 

One of the first mechanical aptitude tests was developed by J. L. Sten- 
quist.^ The Stenquist test consists of three sub-tests or parts. The first 
is an assembly test consisting of ten devices, such as a latch or a bicycle 
bell, which are to be assembled by the student. In recent years this part 
of the test has been superseded by the Minnesola Assembly Test, listed 
below, but the original Stenquist version is still available from the C. H. 
Stoelting Company at $24.50 per set. The other two parts are paper-and- 
pencil tests. Test 1 requires the matching of objects in two sets of pictures, 
such as a brace in one and a bit in the other, or a hammer and an anvil. 
In Test 2, pictures show objects with missing parts. The parts are shown 
in the opposite pictures, and the task is to match the part with the correct 
object. For example, a tricycle whose seat is missing may be shown on 
one side. Test 2 also includes pictures of various mechanical devices such 
as pulleys, block and tackle, and more complex machines, and each picture 
is accompanied by questions regarding the functions of the parts and their 
relationships to each other. 


4 J, L. Stenquist, Measurement of Mechanical Ability (New York: Teachers College, 
Columbia University, 1923), and Stenquist Mechanical Aplitude Tests (Yonkers-on- 
Hudson, N.Y.: World Book Company, 1921). 
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The Stenquist tests have been validated by correlating scores on the tests 
with the ranking of students on general mechanical aptitude in fifteen shop 
classes. The median value of the correlations was .67. In eight classes the 
correlations between scores on the assembly test and the paper-and-pencil 
tests combined, averaged almost the same. 

The Stenquist tests were a pioneering effort in the field. Later tests have 
followed similar patterns, though some have been entirely of the paper-and- 
pencil type, while others have concentrated on performance measures, The 
latter type of test has the disadvantage of all such tests, namely, that it 
cannot be easily adapted for administration to groups, and is, therefore, 
time-consuming and expensive. It is generally believed, however, that this 
type is more valid than its paper-and-pencil counterparts. 


Other Tests of Mechanical Aptitude 


1. Detroit Mechanical Aptitudes Examination. 1939. Grades 7-16. Tool 
recognition, motor speed, size discrimination, arithmetic fundamentals, dis- 
arranged pictures, tool information, direction and speed of pulley and belt 
movements, and digit-letter substitution. 

One form. $2.50 per 25. 30 minutes. 

Public School Publishing Company. 


2. MacQuarrie Test for Mechanical Abilily. 1925. Grades 7-adult. Tracing, 
tapping, dotting, copying, location, blocks, and pursuit. 

One form. $.12 per copy. 20 minutes, 

California Test Bureau. 


3. Mechanical Aplitude Tests. 1943. Grades 9-16 and adults. Compre- 
hension of mechanical tasks, use of tools and materials, matching tools and op- 
erations, and use of tools and materials (pictorial). 

One form. $2.50 per 25. 45 minutes. 

Acorn Publishing Company. 


4. Minnesota Assembly Test., 1930. Ages 11 and up. (Revision of the Sten- 
quist Assembly Test discussed above.) Assembly of 33 common mechanical 
deyices such as push-button, door bell, etc. (Abridged form, 20 devices.) 

Two forms: Complete and Abridged. $71.00 and $52.00, 60 minutes. 

C. H. Stoelting Company. 


5 „Minnesota Paper Form Board, Revised. 1941. Grade 7 and up. Sixty- 
four items measuring ability to think in two-dimensional and three-dimensional 
space. 

Forms AA and BB. $2.00 per 25. 20 minutes. 

Psychological Corporation. 


6. Prognostic Test of Mechanical Abilities, 1946. Grades 7-12 and adults. 
Arithmetic computation, reading simple drawings, use of tools, accuracy in 
measuring and discerning spatial relationships. 

One form. $.12 per copy. 38 minutes, 

California Test Bureau. 
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7. S.R.A. Mechanical Aptitudes. 1947. High school and adults. Tool 
usage, space visualization, and shop arithmetic. 

One form. $.49 per booklet. Answer pads, $2.00 per 20. 40 minutes. 

Science Research Associates. 


8. Test of Mechanical Comprehension. 1940-51. Grade 9 and up. Sixty 
pictures requiring judgment on mechanical and physical principles. 

Forms AA, AA-F (French), BB, CC, and W, (for women). $4.00 per 25. 
30 minutes. 

Psychological Corporation. 


Clerical Aptitude 


The designation above refers to tests in the commercial and business 
field. There are aptitude tests for general office work, typewriting, book- 
keeping, and shorthand. Although extensive work has been done in the 
testing of these aptitudes, the tests so far developed have not been out- 
standingly successful as predictive measures, and they have not been widely 
used by teachers, employers, or counselors. 


Detroit Clerical Aptitudes Examination 


The Detroit Clerical Aptitudes Examination * consists of eight parts. 
Part 1 is a test of rate and quality of handwriting, and involves the copying 
of a short selection. Part 2 consists of a comparison between two sets of 
numbers to determine whether they are the same or different, e.g., 2 7 3 — 
275. Part 3 contains arithmetic problems. Part 4 is a test of manual 
dexterity and requires the student to perform such tasks as drawing 
crosses in circles as rapidly as possible without touching the circles. Part 5 
tests miscellaneous knowledge related to office and business. Part 6 is a 
series of pictures, each presented in several sections or parts of the whole; 
the task is to indicate their proper order or sequence. Part 7 is a sub- 
stitution test in which letters are numbered, according to a key which is con- 
stantly changing. Part 8 is a test of alphabetization. 

Each part of the Detroit Clerical Aplitudes Examination is timed sepa- 
rately. The test is designed for use with pupils at intermediate and junior 
high school grade levels to identify those who will probably succeed in com- 
mercial courses in high school. The reliability, test-retest, is .85. The 
correlation between scores on the test and scholarship in bookkeeping is 
.563; between test scores and scholarship in shorthand, .366; between test 
scores and scholarship in typewriting, .317. The Detroit Clerical Aptitudes 
Examination is available in one form only at $1.95 for 25 copies. The 
handwriting scale used in Test 1 must be ordered separately at $.35 per 


copy. 
55 Harry J. Baker and Paul F. Voelker, Detroit Clerical Aptiludes Examination, Revised 
(Bloomington, I.: Public School Publishing Company, 1944). 
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Other Tests of Commercial and Business Aptitude 


1. Bennett Stenographic Aptitude Test. 1939. High school and college. Sub- 
stitution of numbers for symbols and symbols for numbers, and spelling. 


One form. $2.10 per 25. 25 minutes. 

Psychological Corporation. 

2. E.R.C. Slenographic Aptitude Test. 1944. High school and adult. Word 
discrimination, phonetic spelling, vocabulary, sentence dictation, and speed of 
writing. 

One form. $4.00 per 20. 45 minutes. 

Science Research Associates. 

3. General Clerical Test. 1950. High school and above. Clerical speed and 
accuracy, numerical ability, verbal facility. 

One form. $3.90 per 25. 53 minutes. 

Psychological Corporation. 


4. Minnesola Clerical Test. 1933-46. Grades 8-12 and adults. Speed and 
accuracy in checking numbers and names. 

One form. $1.70 per 25. 15 minutes. 

Psychological Corporation. 

5. Personnel Research Institute Clerical Battery. 1945-47. Adults. (1) clas- 
sification, (2) number comparison, (3) name comparison, (4) tabulation, (5) 
filing, (6) alphabetizing, (7) arithmetic reasoning, (8) spelling. 

Form A of 1-3; Forms A and B of 4-8. $2.00 per 25 copies of any test. Total 
time, including time for directions, 100 minutes. 

Personnel Research Institute. 


6. Short Employment Tesis. 1951. Adults. Vocabulary, arithmetic com- 
putation, clerical skill. s 

Forms 1, 2, 3, 4; 1 and 4are restricted. $1.70 per 25 for each test. 15 minutes. 

Psychological Corporation. 


7. Turse Shorthand Aplitude Test. 1940. Grades 8 and above. Stroking, 
spelling, phonetic association, symbol transcription, word discrimination, dic- 
tation, and word sense. is 

One form. $3.10 per 35. 40 minutes. 

World Book Company. 


Musical Aptitude 
Seashore Measures of Musical Talents 


Various tests of musical aptitude have been devised, though the total 
number of such tests is not large. Probably the best-known is the Seashore.!® 
This consists of tests of pitch, loudness, rhythm, time, timbre, and tonal 
memory, all on phonograph records. The tests are as follows: pitch — the 

16 Carl E. Seashore, et al., Seashore Measures of Musical Talents (New York: Psycho- 


logical Corporation, 1919-1944). Available on three vinylite records at $13.00 t. 
Answer sheets, $2.20 per 50. ylite records at $13.00 per set 
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higher of two tones; loudness — the louder of two tones; rhythm — com- 
parison of two rhythmic patterns; time — the longer of two sounds; timbre 
— comparison of tonal quality to decide whether two tones are the same or 
different; tonal memory — comparison of two short musical figures differing 
in one note, to indicate by number which note is different. The tests are 
said to be applicable from fifth grade up, but they seem to work best with 
older subjects. Reliabilities range from .62 to .88 with an average of about 
.80. The tests require about one hour to administer, and may be given to 
individuals or groups. They show little relationship to differences in amount 
of musical training at advanced levels, but when administered in combina- 
tion with an intelligence test the Seashore shows a definite relationship to 
success in advanced musical study. 

Many musical aptitude tests are largely paper-and-pencil tests which 
measure knowledge and understanding of musical terms and symbols, 
though a few include tests of some of the types of performance that are 
measured by the Seashore. 


Other Tests of Musical Aptitude 


1. Drake Musical Aplitude Tests. 1954. Grades 3 through college. Musical 


memory and musical rhythm. One 33% LP record. 
Forms A and B. $6.95 per record and manual; answer pads, $2.75 per 20. 


25 minutes. 
Science Research Associates. 
2. Drake Test of Musical Talent; A Musical Memory Test. 1934-42. Ages 


8 and up. Memory for musical melodies. 
Forms A and B. $.50 per copy of examiner's booklet. 25 minutes. 


Public School Publishing Company. 

3. Kwalwasser-Dykema Music Tests. 1930. Grades 4-16 and adults. 
Tonal memory, quality discrimination, intensity discrimination, feeling for 
tonal movement, time discrimination, rhythm discrimination, pitch discrimina- 
tion, melodic taste, pitch imagery, and rhythm imagery. 

One form. Five phonograph records, $18.00; record blanks, $1.50 per 50. 
60 minutes. 

C. H. Stoelting Company. 

4. Musical Aptitude Test. 1950. Grades 4-10. Rhythm recognition, pitch 
recognition, melody recognition, pitch discrimination, advanced rhythm 


recognition. 
One form. $3.00 per examiner’s booklet; answer sheets, $.03 per copy. 


(Piano necessary.) 40 minutes. 
California Test Bureau. 


Art Aptitude 
Closely related in some respects to musical aptitude tests are measures of 
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aptitude in the visual or graphic arts. Among these is a test by William H. 
Varnum," which, though not widely known, will be briefly described. 


Selective Art Aptitude Test 


The Varnum Selective Art Aptitude Test is a fairly complicated test which 

consists of seven parts judged by the author to be basic elements in aptitude 
for graphic arts. These are named Acuity of Vision (ability to draw ac- 
curately, free-hand), Color Memory, Tonal Relationships (matching of 
colors), Proportionate Relationships, Balance and Rhythm, Rapidity 
under Creative Stimulus, and Creative Imagining. The test is designed for 
use with subjects aged 14 and beyond. The time required is about 45 min- 
utes. 
This test has been developed with care and imagination. Its reliability 
for groups 18 years and older is .83; for subjects aged 14 to 18 it is about .66. 
Both are split-half coefficients. Fifty-five high school and college students 
have furnished a test-retest coefficient of .877. Much work was done to 
validate the Varnum test by comparing scores of persons in different lines 
of work requiring more or less artistic talent, of students in art and non-art 
subjects, and of persons in different age groups. The data seem to indicate 
that the test does differentiate quite well between such groups. 

The Varnum test is better in certain respects than most art aptitude 
tests. It is more comprehensive in scope and more adequately standardized 
and evaluated. Its chief drawback is the complicated procedure necessary 
for administering and scoring; this may account, at least in part, for the 
fact that the test is not widely known. 


Other Tests of Art Aptitude 


1. Graves Design Judgment Test. 1948. Grades 7-16 and adults. Ninety 
sets of two- and three-dimensional designs calling for discrimination on the 
basis of eight principles of art. s 

One form. $1.50 per booklet; answer sheets, $1.90 per 50. 20-30 minutes. 

Psychological Corporation. 


2. Horn Art Aptitude Inventory. 1951. Grades 12-16 and adults. Outline 
drawings of simple objects, and creative composition. 

One form, $5.00 per 50. 50 minutes. 

C. H. Stoelting Company. 


3. Knauber Art Ability Test. 1932. Grades 7-16. Ability to draw and 
design, and ability to find flaws in drawings. 

One form. $7.50 per 50. 180 minutes. 

C. H. Stoelting Company. 


"7 William H. Varnum, Selective Art Aptitude Test (Scranton, Pa.: International Text- 
book Company, 1946). One form of the test is available at $1.20 per copy; manuals, 
scoring keys, and answer booklets extra. 
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4. Meier Art Judgment Test. 1940. Grades 7-12. Selecting the better of 
two pictures in 100 pairs. 

One form. $1.40 per copy. 45-60 minutes. 

Bureau of Educational Research and Service. 


Aptitude Batteries 


A fairly recent development in measuring aptitudes is the aptitude bat- 
tery, or what is sometimes called the differential or factored aptitude battery. 
Such batteries have grown out of the development of factor analysis dis- 
cussed earlier in this chapter. The attempt to isolate and identify specific 
factors in general intelligence has carried over into the measurement of par- 
ticular aptitudes. It was reasoned that test batteries could be devised which 
would measure a number of important abilities or traits that enter into 
many types of work or activities. Then if the scores on the parts of the 
battery could be related to specific occupations or groups of occupations, 
these part scores could be differentially weighted according to their im- 
portance in such occupations. 

Tn addition to the impact of factor analysis, the growing number of per- 
sons engaged in educational and vocational guidance or counseling has in- 
creased the interest in and the demand for such batteries. Obviously, no 
counselor could test his client with all aptitude tests to determine how to 
advise him. Therefore, a single battery or group of tests would be very 
useful. 

Broadly speaking, one might include in the category of aptitude batteries 
the California Mental Maturity Test and any others which yield part scores 
that can be demonstrated to have low inlercorrelations. (Obviously, if the 
parts of a battery have high correlations with each other they are measuring 
the same thing to a substantial degree and thus have little “differential” 
value.) Not many test batteries meet this criterion very well. We are con- 
cerned here with batteries developed to meet the criteria as outlined above 
and those which are designed primarily for general counseling purposes. 
Several such batteries are now available, and a typical one will be described 


below. 


Differential Aptitude Tests 

The Differential Aptitude Tesis? were some of the first of this type. They 
consist of tests of Verbal Reasoning (analogies), Numerical Ability (arith- 
metic computation), Abstract Reasoning, Space Relations, Mechanical 
Reasoning, Clerical Speed and Accuracy, and Language Usage (spelling, 
grammar, punctuation). 


15 G, K. Bennett, H. G. Seashore, and A. G. Wesman, Differential Aptitude Tests (New 
York: Psychological Corporation, 1947). Sample items in test description reproduced by 
permission of the publisher. 
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The Abstract Reasoning consists of items like the following: 
PROBLEM FIGURES ANSWER FIGURES 


TLO OREA 


a 


The task is to select from among the Answer Figures the one that should 
come next in the series of Problem Figures. 

Space Relations consists of items in which a pattern that can be folded 
into a figure is followed by five figures, one or more of which can be formed 
from the pattern. The task is to identify the figures which can be formed 
from the pattern at the left. 


x 


Which man has the heavier load? 
(If equal, mark C.) 


Clerical Speed and Accuracy consists of sets of letter and number com- 
binations in pairs. In the first set one combination is underlined. The 
task is to find the same combination in the second set. 


Test IrEMS SAMPLE oF ANSWER SHEET 
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The tests are designed for use in junior and senior high schools, but may 
also be used with young adults in employment counseling. They are sepa- 
rately administered and scored. The time required for the entire battery 
is about 33 hours. , IBM answer sheets are used, of which two forms are 
available. Reliability coefficients (split-half) for the eight sub-tests are 
generally in the .80’s and low .90's except for Mechanical Reasoning when 
used with girls, where the average is about .70. 

Ideally, in a battery such as this the correlations between the sub-tests 
should be low. They range from .06 to .67 and the authors believe that 
these correlations indicate that what the separate tests measure is suffi- 
ciently different to warrant the inclusion of every test in the battery. By 
statistical analysis it is shown that the separate tests meet a rather high 
standard of differentiating power. 

A continuing program of validation of the test battery for predictive pur- 
poses is being condücted by the authors and publishers. The studies re- 
ported to date are concerned principally with correlations between the sub- 
tests of the battery and grades in a wide variety of courses, tests of intelli- 
gence, other aptitude batteries, and objective achievement tests. In gen- 
eral, these results show that the tests do exhibit a differential relationship to 
achievement in various subject matters and to mental ability levels. 
Further validation against vocational success is to be hoped for. 

The tests are accompanied by a detailed Interpretative Manual and a 
casebook, Counseling from Profiles, which includes a small number of in- 
dividual case histories. 


Other Aptitude Batteries 
1. Aptitude Tests for Occupations. 1951. Grades 9-13. Personal-social, 
mechanical, general sales, clerical routine, computational, and scientific 
aptitudes. B 
One form. Complete set of six tests, $.41. $.06 to $.10 per copy in packages 


of 35. 107 minutes. 
California Test Bureau. 


2. Flanagan Aplitude Classification Tests. 1953. High school and adults. 
Inspection, coding, memory, precision, assembly, scales, coordination, judg- 
ment and comprehension, arithmetic, patterns, components, tables, mechanics, 
expression. 

One form. Each of the 14 tests, $2.55 per 20. Two half-day sessions, each 
with ten-minute break. Total time, including time for directions, 2 hours, 46 
minutes, and 2 hours, 42 minutes. 

Science Research Associates. 


3. General Aptitude Test Battery. 1947. Ages 16 and up. Intelligence, ver- 
bal, numerical, spatial, form perception, clerical perception, eye-hand coordina- 
tion, motor speed, finger dexterity, manual dexterity. 
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One form. $54.65 per complete set, 135 minutes. 
United States Employment Service, United States Department of Labor. 


4. Holzinger-Crowder Uni-Factor Tests. 1954. Grades 7-12. Verbal, 
spatial, numerical, and reasoning. 

Forms Am and Bm, $6.00 per 35; Answer Sheet 1, $2.25 per 35; Answer 
Sheet 2, $1.50 per 35. Total time, 90 minutes. 

World Book Company. 


5. Multiple Aptitude Tests. 1954. Grades 7-13. Word meaning, para- 
graph meaning, language usage, routine clerical facility, arithmetic reasoning, 
arithmetic computation, applied science and mechanics, spatial relations — 
two dimensions, and spatial relations — three dimensions. 

One form. $.70 per set of nine tests. Individual tests, $.07 to $.13 per copy 
in packages of 35. 177 minutes. 

California Test Bureau. 


Other Aptitude Tests 


In addition to the types already mentioned, a number of aptitude tests 
have been developed in other subjects or areas such as foreign languages, 
algebra, science, blueprint-reading, filing, and manual dexterity. On the 
professional level aptitude tests or batteries are currently used in the fields 
of education, dentistry, law, medicine, nursing, and engineering. Such 
tests are also finding wide use in a number of vocations. 

The use of aptitude or prognostic tests in these and other fields is a 
promising development which gained considerable impetus during and 
since World War II. When millions of men who had had little or no train- 
ing for a particular specialty were inducted into service they had to be 
assigned to training by classification officers, and the need for some reason- 
ably accurate, objective, and rapid method of making such assignments 
quickly became apparent. It is now obvious that valid and reliable prog- 
nostic tests may become the means of saving incalculable time, energy, and 
money for the millions of young people who are faced with decisions about 
their life’s work. 


* Learning Exercises œ 


11. Compare and contrast tests of readiness, intelligence, and aptitude with re- 
gard to (a) purposes, (b) nature of content, and (c) methods of validation. 

12. Make a list of the uses of intelligence tests in the schools. Do the same for 
aptitude tests and compare the lists. What uses occur for both types? What uses 
occur for only one type? 

13. To whom should a pupil's 7.Q. be made known? Where should the informa- 
tion be filed? Discuss this problem from the standpoint of the teacher, the pupil, 
and the parent. 
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14. Assume that you are having a conference with a parent about his son or 
daughter, and you wish to convey to him that his child is doing as well as can be 
expected. How would you express this in meaningful but general terms? 


Annotated Bibliography 


1, Anastasi, Anne. Psychological Testing. New York: The Macmillan Com- 
pany, 1954. 682 pp. This volume consists of four parts. Part I deals with 
the principles of psychological testing, including basic theory and uses. Part IT 
deals with general classification or general intelligence tests. Part III deals with 
aptitude tests and includes some discussion of achievement tests. Part IV deals 
with measurement of personality, including measures of interests and attitudes, 
and sociometric devices. 


2. Cronbach, Lee J. Essentials of Psychological Testing. New York: Harper 
and Brothers, 1949. 475 pp. Part I presents basic concepts, with emphasis on 
selection and use of mental tests. Part II deals chiefly with measures of intelligence, 
but also includes consideration of tests of special abilities and achievement. Part IIT 
treats personality testing, interest tests, observation techniques, and the use of 
tests in counseling. 


3. Freeman, Frank N. Mental Tesls, Revised. Boston: Houghton Mifflin 
Company, 1939. 460 pp. One of the first books in the field of intelligence tests, 
this presents the history, theory, development, and applications of mental tests. 


4. Freeman, Frank S. Theory and Praclice of Psychological Testing, Revised. 
New York: Henry Holt and Company, 1955. 609 pp. Major attention is given to 
the theory, problems, development, nature, and types of intelligence tests. There 
are also chapters on aptitude tests and personality tests, including discussion of pro- 
jective methods and sociometric devices. j 


5. Goodenough, Florence L. Mental Testing. New York: Rinehart Company, 
Inc., 1949. 609 pp. A thorough treatment of intelligence tests, including historical 
background, principles and methods, tests and scales, and applications. Also in- 
cludes discussion of tests of aptitudes, interests, attitudes, and personality. 


6. Greene, Edward B. Measurements of Human Behavior, Revised. New York: 
The Odyssey Press, 1952. Chapters 5, 6, 8, 9, 10. This book presents excellent brief 
discussions of intelligence and aptitude tests and scales. Chapter 5 deals with tests 
of early childhood; Chapter 6, with individual tests of ability; Chapter 8, with group 
tests of ability; Chapter 9, with mechanical and motor tests; and Chapter 10, with 
tests of special aptitudes. 


1. Greene, Harry A., Jorgensen, Albert N., and Gerberich, J. Raymond. Meas- 
urement and Evaluation in the Elementary School, Second Edition. New York: Long- 
mans, Green and Company, 1953. (See also Measurement and Evaluation in the 
Secondary School.) Chapter 10. A brief discussion of intelligence testing in the 
schools with emphasis on basic concepts and interpretation of results. 


3. Jordan, A. M. Measurement in Education. New York: McGraw-Hill Book 
Company, Inc., 1953. Chapters 14, 15. A good introductory treatment of the de- 
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velopment of intelligence tests and the use of group tests. Emphasis on the uses 
of intelligence tests. 


9. Murphy, Gardner. An Historical Introduction to Modern Psychology. New 
York: Harcourt, Brace and Company, 1929. Chapter 21. A basic text on the his- 
tory of modern psychology from the seventeenth century. Presents considerable 
background material on the development of modern intelligence tests and brief 
accounts of the work of Binet and other pioneers in this movement. 


10. Peterson, Joseph. Early Conceptions and Tests of Intelligence. Yonkers-on- 
Hudson, N.Y.: World Book Company, 1925. 295 pp. The definitive book on the 
history of modern intelligence tests up to the end of World War I. Especially valu- 
able reference on the development of Binet's scales and their successors. 


11. Remmers, H. H., and Gage, N. L. Educational Measurement and Evaluation, 
Revised. New York: Harper and Brothers, 1955. Chapters 8-10. These three 
chapters deal with the nature of mental abilities and the measurement of general 
mental abilities and of special abilities or aptitudes. 


12. Super, Donald E. Appraising Vocational Fitness. New York: Harper and 
Brothers, 1949. 727 pp. A thorough treatment of the use of tests in vocational 
counseling. Includes consideration of tests of intelligence, achievement, aptitude, 
interests, personality, and the use of such tests in predicting job success. Many 
tests are carefully studied, described, and appraised for their value in vocational 
counseling. 


13. Terman, Lewis M., and Merrill, Maud A. Measuring Inlelligence. Boston: 
Houghton Mifflin Company, 1937. 461 pp. Describes in detail the work of de- 
veloping the Revised Stanford-Binet, Forms L and M. Also includes the complete 
scales with directions for administering and scoring them. 


14. Thorndike, Robert L., and Hagen, Elizabeth. Measurement and Evaluation 
in Psychology and Education. New York: John Wiley and Sons, Inc., 1955. Chap- 
ters 9 and 10. Chapter 9 deals with standardized tests of intelligence. It presents 
a brief but sound discussion of different approaches to the measurement of intel- 
ligence and a discussion of the significance of the results of such measures. Chap- 
ter 10 consists of a discussion of measurement of.special aptitudes with particular 
reference to job advisement and selection. Consideration is also given to aptitude 
measurement in music and art. 


15. Torgerson, Theodore L., and Adams, Georgia S. Measurement and Evalua- 
lion for the Elementary School Teacher. New York: The Dryden Press, 1954. Chap- 
ter 4. A brief discussion of some of the theoretical aspects of intelligence testing. 
There is little on the nature and use of intelligence tests. 


ll 


The Measurement of Personality and 


Adjustment: Self-Report Techniques 


Among college and university students personality is often considered to 
be synonymous with popularity. The individual who knows a large number 
of people, who is present at all functions, and who is persuasive in his deal- 
ings with others is said to have a “wonderful personality.” While there is 
undoubtedly some slight basis for this point of view, it is quite inadequate as 
an acceptable definition of personality. In a deeper sense, personality is 
the most inclusive frame of reference in which an individual can be judged. 
It includes the sum of all his characteristics and his behavior — his intelli- 
gence, knowledge, attitudes, interests, and his responses to and interaction 
with his environment. Personality thus broadly conceived is the total of all 
of these qualities, together with the effects of the combination of them 
on what he thinks, feels, says, and does. 

If we can accept such a broad definition of personality, we may go a step 
further and suggest that personality has two aspects: inner and outer. The 
inner phase refers to the adjustment of the individual within himself. That 
is, does he have a realistic and satisfying self-concept? Is he confident, 
sure of his personal worth? Has he set suitable and challenging goals for 
himself, keeping in mind his own limitations as well as his strengths? Is 
he satisfied with his occupation? Of course there are many more factors 
which importantly affect one’s inner personality, but these are among the 
chief elements in this concept. 

The outer or interpersonal phase of personality concerns the individual’s 
relationships with other people. Is he a useful member of family and com- 
munity groups? Does he have the respect and affection of his associates? 
Is he capable of enjoying group activities? Such questions point up the 
social or interpersonal aspect .of personality as it is conceived here. 

We cannot emphasize too strongly the importance of adjustment in the 
definition of personality. Indeed, the individual who is well adjusted is 
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most likely to be happy and to have a personality which makes a favorable 
impression on others. Conversely, the poorly adjusted person, almost by 
definition, is unhappy, and consequently his relationships with others will 
tend to be strained and difficult. Thus, one's personality is not some super- 
ficial characteristic that may be briefly adopted; rather, it is a reflection of 
the person’s innermost self, and it influences and becomes a part of every- 
thing he does. 

There are two fundamental approaches to the measurement of personal- 
ity or adjustment. In the simplest terms, the first approach involves asking 
the individual himself what he thinks, feels, says, and does; the second in- 
volves finding out about the individual from others who have known him. 
The first may be called the self-report approach; the other, the observational 
approach. In the self-report technique the examiner asks questions or 
presents stimuli to which the individual being measured responds. From 
the replies obtained, the examiner can formulate some idea of the subject’s 
personality or certain aspects of it. In the second, or observational, tech- 
nique the examiner asks someone who knows the subject well to express 
opinions about the subject. Both the self-report and the observational 
methods are widely used. Self-report techniques may be distinguished from 
the observational by the fact that the former are more often tests or other 
devices which yield scores. Thus, personality inventories, tests of attitudes, 
and tests of interests are typical self-report instruments, whereas rating 
scales, anecdotal records, and sociograms are examples of the observational 
type of report. 

The advantages and disadvantages of self-report techniques will be dis- 
cussed in this chapter, and those of the observational techniques will be 
treated in Chapter 12. 


PERSONALITY INVENTORIES 


One of the best-known and most widely used types of self-report is the 
personality inventory. Literally hundreds of these have been produced 
and many have been published. In effect, they are a kind of questionnaire 
or check list to which the individual responds by indicating how he generally 
feels or acts. 


Bernreuter's Personality Inventory 


The Bernreuter Personality Inventory * was one of the first personality 
inventories to be published. It consists of 125 questions to which the sub- 
! Quotations reprinted from The Personality Inventory, by Robert G. Bernreuter, with 


the permission of the publishers, Stanford University Press. Copyright 1935 by the Board 
of Trustees of Leland Stanford Junior University. X RR DUE 
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ject responds with a “yes,” “no,” or a question mark. A few questions 
from this inventory will show its nature. 


Do you often feel just miserable? 

Do you ever heckle or question a public speaker? 
Are you thrifty and careful about making loans? 
Do you make new friends easily? 

Do people ever come to you for advice? 

Do you worry over possible misfortunes? 

Are you considered to be critical of other people? 


Six separate scoring keys are available for use with the Bernreuler. These 
keys correspond to the six aspects of personality measured by the inventory: 
Self-Sufficiency, Neurotic Tendency, Introversion-Extroversion, Domi- 
nance-Submission, Confidence, and Sociability. The individual responds 
only once to each question, and his responses are scored with each key 
separately, since the significance of a given response to a particular question 
may vary with each aspect of personality. All of these categories of scoring 
represent areas that are psychologically meaningful. 

When the Bernreuler was first published it could be scored for only the 
first four scales mentioned above. Subsequently, it was found by the use of 
factor analysis that the essentials of these four scales could be expressed 
in the two which are called Confidence and Sociability. The correlations 
between some of the original four were high enough to suggest that they 
were measuring or expressing practically the same thing. The inventory 
has separate norms for males and females at high school, college, and adult 


levels. 


The Mooney Problem Check Lists 

The Mooney Problem Check Lisis on the other hand, yield data on prob- 
lems or difficulties in such categories as Health and Physical Development, 
Home and Family, Morals and Religion, Sex, Economic Security, School or 
Occupation, and Social and Recreational. Obviously, this represents a 
different approach or analysis from that of the Bernreuter. A number of 
other personality inventories yield scores that are indicative of adjustment 
in similar areas. 

The Mooney Problem Check Lists are available in different forms for the 
junior high school, high school, college, and adult levels. "The areas covered 
are essentially the same in each form, although there are differences in the 
items at the various maturity levels. The person responding is asked to 
read the list of several hundred items, underline those problems which are 


d V. Gordon, The Mooney Problem Check Lists (New 


2 Ross L. Mooney and Leonar j 
Examples from the Check Lists used by 


York: The Psychological Corporation, 1950). 
permission of the publisher. 
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troublesome to him, and indicate the two or three which are of real concern. 
The items are grouped according to the different areas. In the junior high 
form the areas are Health and Physical Development; School; Home and 
Family; Money, Work, and the Future; Boy and Girl Relations; Relations 
to People in General; and Self-Centered Concerns. In the high school and 
college forms these same areas are included with additional ones such as 
Morals and Religion, and Curriculum and Teaching Procedure. The check 
lists can be completed by most persons in fifty minutes or less. The follow- 
ing samples from the junior high school form show the nature of the arrange- 
ment: 


1. Often have headaches 
Health 2. Don't get enough sleep 
and . 3. Have trouble with my teeth 
Physical 4. Not as healthy as I should be 
Education 5. Not getting outdoors enough 
6. Getting low grades in school 
7. Afraid of tests 
School 8. Being a grade behind in school 
9. Don't like to study 
10. Not interested in books 
ll. Being an only child 
Home 12. Not living with my parents 
and 13. Worried about someone in the family 
Family 14. Parents working too hard 


15. Never having any fun with mother and dad 


In the junior high school form there are 210 items in seven areas; in the 
high school, college, and adult forms there are 330 items in eleven areas. In 
every area there are 30 items. 

The Mooney Problem Check Lisls are not tests in the usual sense, and do 
not yield scores. However, they do provide a useful tool for teachers and 
especially for counselors in locating problem areas which may then be in- 
vestigated by the use of more refined techniques. The Mooney Lists can 
also be used as a basis for guidance and orientation programs, and as a 
foundation for increasing teacher understanding in the classroom. 


Pintner's Aspects of Personality 


One other inventory of the self-report type is Pintner's Aspects of Person- 
ality} this is designed for use with subjects younger than those tested by 
either the Bernreuler or the Mooney, being suitable for Grades 4 to 9. 
The aim of this test is to reveal something significant about a child’s per- 


3 Rudolf Pintner, et al., Aspects of Personality (Yonkers-on-Hudson, N.Y.: World Book 
Company, 1938). 
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sonal adjustment in the areas of Ascendancy-Submission, Introversion- 
Extroversion, and Emotional Stability. The items are modifications of 
those in leading adult inventories, stated in language appropriate to the 
fourth-grade level. 

The child’s responses to the items of the inventory are on a same-different 
basis. Thus, the child indicates his agreement or disagreement with an 
item such as, “When some child tries to push into line ahead of me, I am 
not afraid to tell him to get back."^ A low score on the Ascendancy-Submis- 
sion part may indicate a shy, retiring-type child; children with a high score 
are likely to be domineering and bullying. A low score on Introversion- 
Extroversion may indicate a tendency on the part of the child to withdraw, 
dodge responsibility, and live in a world of fantasy. Children scoring low 
on Emotional Stability are likely to be flighty and easily upset. 

It is suggested that Aspects of Personalily may be used as a screening de- 
vice to identify children who need psychiatric advice, as an aid in educa- 
tional and vocational guidance, and as a guide for the psychologist or psy- 
chiatrist in studying and diagnosing cases of maladjustment. 


Several questions are perennially raised concerning the self-report type of 
personality inventory. One of these has to do with its validity and reliabil- 
ity. A second relates to the problem of the subject’s faking or simulating 
responses. A third question concerns the usability of such inventories in 
schools. How much use should be made of them? By whom should they be 
used? Full discussions of these questions are beyond the scope of this book, 
but a few statements will be made to provide some information on each 
point. 

First, concerning validity, a little reflection will make it clear that the 
usual methods of establishing validity cannot apply to personality inven- 
tories. It is not possible to establish validity of personality inventories 
by correlating the scores with age, grade, or I.Q. One criterion often used 
is a comparison of scores with teachers’ ratings of adjustment; another is a 
comparison of scores with case histories to see if those who make poor scores 
on the inventory show a history of maladjustment, problem behavior, and 
personality disorders. On the whole, such studies show the self-report in- 
ventory to have fairly satisfactory validity as a screening device. Although 
these instruments generally do not reveal minute differences in adjustment, 
they serve to identify most of the serious cases of maladjustment. 

As to reliability, data vary widely. Some inventories report reliability 
coefficients as high as .90. Some, as in the case of the Mooney, do not yield 
scores and thus do not lend themselves to statistical analysis. In general, 
the reliability of personality inventories is lower than. that of good standard- 

4 Ibid. Quotation used by permission of the World Book Company, publisher. 
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ized tests of intelligence or achievement. The reliability coefficients range 
from the .60’s to the .80’s, with the average falling somewhere between .70 
and .80, In this connection, it must be kept in mind that it is not the sub- 
ject’s response to one or even a few items that is meaningful, but rather, 
the trend of his responses to a large number of items. For example; in the 
Bernreuler there are 125 items; in the Mooney, more than 200; and even in 
the Aspecls of Personalily there are 105. 

On the problem of faked or simulated responses there is some evidence 
from research. The information obtained from some of these studies may 
be summed up briefly as follows: 

1. It is possible to fake or slant answers on a self-report inventory. 
2. It is not as easy to do this as might be supposed, especially to 
take and maintain consistently a pose. One's ability to slant answers 
is affected, among other things, by 
a. The sophistication of the subjects 
b. The subtlety of the statements on the inventory 
c. The number of items 
3. Some individuals cannot consistently fake answers in a desired 
direction even when told to do so. 

It is obvious that a personality inventory differs from a test of achieve- 
ment in this respect. In the latter case there is little chance for faking or 
bluffing if it is a good test, while in the former, the value of the responses 
depends on the subject’s willingness to be truthful and his ability to give 
accurate representations of his behavior and feelings. Unless he is cooper- 
ative, the responses are of no value. 

As to the question of whether personality inventories should be used in 
schools, it may be said, first of all, that they should be used very conserva- 
tively. Whereas the average classroom teacher can handle standardized 
tests of achievement and intelligence with some help and guidance from the 
counselor or school psychologist, this is seldom the case with personality 
tests. The use and interpretation of these require more training and 
experience than most teachers possess. As a rule, the personality test 
should be given on an individual basis rather than on a school-wide basis. 
When a child asks his teacher or counselor for help, or shows symptoms of 
maladjustment, the use of a personality inventory certainly may be indi- 
cated. It is important that the results be held in strictest confidence and 
used only with caution by a person qualified to interpret them. Even a 
tool like the Mooney Check Lists should probably be used selectively rather 
than on a school-wide basis, and the results should be made available only 


5 See, for example, Victor H. Noll, “Simulation by College Students of a Prescribed 
Pattern on a Personality Scale," Educational and Psychological Measurement, 11:478-88 
(Spring, 1951). 
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to guidance workers or school psychologists. When these principles are not 
adhered to, the results will often prove detrimental to pupil morale as well 
as to the parents’ confidence in the method used. This, in turn, may de- 
crease the value of the inventory as a whole. 


Other Personality Inventories 


1. Adjustment Inventory. 1934-38. Grades 7-16 and adults. Home, 
health, social, emotional, and occupational adjustment. A 

Student form and adult form. Student form, $1.00 per 10; adult form, $1.20 
per 10. 30 minutes. 

Stanford University Press. 


2. A-S Reaction Study. A scale for measuring Ascendance-Submission in per- 
sonality. 1928. College students and adults. Measures the tendency to domi- 
nate or be dominated in face-to-face relationships. 

Form for men and form for women, $3.80 per 35. 20 minutes. 

Houghton Mifflin Company. 


3. California Test of Personality. 1953. Primary, K-3; Elementary, 4-8; 
Intermediate, 7-10; Secondary, 9-College; Adults. Self-adjustment and social 
adjustment. 

Forms AA and BB. $.09 per copy. 50 minutes. 

California Test Bureau. 


4. Detroit Adjustment Inventory. 1942-53. Grades 3-6. Adjustment in 
four areas: self, home, school, and community. 

Form Gamma. $3.10 per 25. 30 minutes. 

Public School Publishing Company. 


5. Guilford-Zimmerman Temperament Survey. 1949. Grades 9-16 and 
adults. General activity, restraint, ascendance, sociability, emotional stabil- 
ity, confidence, personal relations, home satisfaction. 

One form. $3.75 per 25. Answer sheets, $.75 per 25. 45 minutes. 


Sheridan Supply Company. 


6. Minnesota Mulliphasic Personality Inventory. 1951. Age 16 and up. 
Depression, hysteria, psychopathic deviate, masculinity and femininity, para- 
noia, psychasthenia, schizophrenia, hypomania, and social introversion. 

Individual form and group form. Individual, $13.50 per set, plus $1.50 per 
manual; scoring keys and manual, $8.50. Group, $5.50 per 25, plus $1.50 per 
manual. Answer sheets, $3.60 per 50. Scoring Keys and manual, $4.50. 60- 
90 minutes. 

Psychological Corporation. 


7. S.R.A. Junior Inventory. 1951-55. Grades 4-8. Getting along with 
others, my home and family, my health, about myself, my school, and things 
in general. 

Forms A and §, $2.00 per 20. 40 minutes. 

Science Research Associates, 
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8. S.R.A. Youth Inventory. 1949. Grades 7-12. My school, looking ahead, 
about myself, getting along with others, my home and family, boy meets girl, 
health, things in general, and basic difficulties. 

One form. $.49 each booklet; self-scoring answer sheets, $2.00 per 20. 45 
minutes. 

Science Research Associates. 

9. Study of Values. 1951. College and adult. Aims to measure the rela- 
tive prominence of six basic interests or motives in personality: the theoreti- 
cal, economic, aesthetic, social, political, and religious. 

Form for men and form for women, $3.80 per 35. 

Houghton Mifflin Company. 


PROJECTIVE TECHNIQUES 


Another type of self-report is the projective test, which is a clinical in- 
strument to be used only by psychiatrists or clinical psychologists. It de- 
rives its name from the fact that in his responses the subject “projects” his 
feelings, emotions, conflicts, and problems. The projective tests are less 
structured than the personality inventories in that the questions, items, or 
stimuli are less definite and specific, and the subject is far more free to make 
responses in his own words. 

The Rorschach Ink-Blot Test is a well-known example of a projective test. 
It consists of a series of what purport to be inkblots, some black,.some in 
color. These are shown to the subject one at a time, and he is asked to tell 
what they suggest or remind him of. From the subject’s responses the 
psychiatrist can determine much about the presence and nature of deep- 
seated emotional conflicts and maladjustments which the subject himself 
may not understand or even be conscious of. Obviously, such a test re- 
quires much training and experience to administer and interpret. 

A simpler form of projective technique, and one probably antedating 
such instruments as the Rorschach, is the word-association test. The pro- 
cedure in this type of test is to present a list of words one at a time to the 
subject, asking him to give the first word that comes to mind in each case. 
Some of the words are “loaded,” that is, they may carry emotional aspects 
for some individuals under certain circumstances. To be more specific, 
suppose a teacher or psychologist presented the following list of words to 
fifth-grade children, and asked each pupil to tell or write the first word that 
came to mind in each instance. Let us assume that the list of words has 
been used in this way with many children, and that certain conventional or 
non-emotional responses have been identified. The results might reveal that 
two pupils responded as follows: 
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SriwuLUs Worp Puri, A Puri, B 
House big white 
Apple eat sour 
Paper write burn 
Teacher lady, ugly 
Sky blue high 


. 


On the basis of the previous trials of these same words the responses of 
Pupil A might be regarded as quite conventional and normal. In the case of 
Pupil B, however, certain responses are found to be different in character. 
To apple, he responds “sour”; to paper, “burn”; and to teacher, “ugly.” 
The psychiatrist might interpret these responses as reflecting some factors 
in Pupil B’s emotional makeup that would warrant further investigation. 
Not only the response itself, but other considerations as well, have signifi- 
cance. For example, long hesitation before responding to a given word 
may bea sign of emotional blocking. Such a word list is a simple projective 
device in that the subject often “projects” his complexes or problems in the 
responses he makes. 

Some projective tests consist of pictures of people, and still others use 
objects such as toys or simple mechanical devices. There are many pro- 
jective tests and devices and they are especially popular with European 
psychologists and psychiatrists. 

Before closing this brief discussion, it may be appropriate to mention one 
rather well-known application of the association technique, i.e., the poly- 
graph or lie detector. In this technique, the person suspected of a crime is 
asked if he will submit to such a test. Usually he agrees to do so. If he is 
innocent he has nothing to fear or lose by taking it; if guilty, he is afraid that 
refusal will reflect on him adversely, so he generally agrees to take it with 
the hope of “beating it.” The test depends upon the known effect of strong 
emotion on blood pressure, pulse rate, and amount of palmar sweating, all 
of which are increased by heightened emotion. Normal rates of each are 
established for the subject under simple and innocuous questioning, then 
loaded questions are introduced. If the pulse, blood pressure, and palmar 
sweating increase, it is judged that the subject is not telling the truth. 
This, of course, is not accepted as proof of guilt, but the results, when shown 
to a guilty suspect, often bring about a confession. In general, specialists 
in crime detection are of the opinion that the results of lie detector tests are 
quite reliable, and that few persons are successful in “beating the test.” 
The harder the individual tries and the more determined he is to beat the 
test, the more pronounced are the tell-tale signs when a loaded question or 
some significant piece of evidence is suddenly introduced. 

For those interested in the further study of projective tests and pro- 
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cedures, certain references in the bibliography at the end of this chapter 
should prove helpful. 


e Learning Exercises © 


1. Define (a) personality, (b) adjustment. ‘ 

2. How might the results of personality tests be used by (a) classroom teachers, 
(b) guidance workers, and (c) school psychologists? 

3. Is there any place for projective tests in the school? If so, under what cir- 
cumstances? 

4, Graphology, the attempt to study personality through handwriting, is much 
favored in some European countries. How would you set up a scientific experiment 
to test the accuracy of analyses of handwriting of different individuals? 


INTEREST INVENTORIES 


Though interest inventories are not measures of personality and adjust- 
ment in a technical sense, it seems appropriate to consider such inventories 
at this point, since a person’s interests reflect his personality and are a part 
of it; moreover, his interests in relation to his abilities, opportunities, and 
background may have a definite bearing on his adjustment. Interest in- 
ventories are useful tools for counselors and school psychologists in helping 
the individual to make appropriate educational and occupational choices; 
inappropriate or unsuitable choices often lead to maladjustment and serious 
loss of time and energy. 

The interest inventory is based on the theory that a dependable picture 
of a person’s interest pattern can be obtained by asking him to express likes 
and dislikes of a large number of diverse activities and things. It is as- 
sumed, furthermore, that persons successful in the same occupation or field 
of work will have patterns of interests that are similar. Thus, a successful 
motion picture actor will have patterns of interests that are similar to the 
patterns of other successful actors. Finally, it is assumed that the patterns 
of interests of persons successfully engaged in one occupation — teaching, 
for example — will differ from those of persons in another field — engineer- 
ing or chemistry. These three assumptions are at the root of the develop- 
ment of interest inventories. Two typical inventories of this type will be 
briefly described. The total number of such inventories is not large, as 
compared with the number of tests in other types of personality measure- 
ment, and this is probably because the demand for the inventories is com- 
paratively small and also because they are all quite similar in content and 
organization. 
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Strong Vocational Interest Blanks 


One of the earliest and best-known interest inventories is by Strong, It 
consists of lists of such things as Occupations, School Subjects, Amusements, 
Activities, and Peculiarities of People. In each group of items the individual 
is asked to express a preference, either in terms of “Like,” “ Indifferent,” or 
“Dislike,” or an order of preference for the items, For example, in Part I 
— Occupations 5: 


l Actor . TANT D 
2. Advertiser TID 
3. Architect . DODOD 
4. Army Officer. [SD 
5. Artist LID 


The subject encircles L if he would like or be interested in that occupation, 
D if he would dislike it, and Z if he is indifferent toward it. 
Again, in Part VI — Order of Preference of Activities; 


Indicate by checking in Column 1 the three activities you would 
enjoy most; in Column 3 the three you would enjoy least; and the 


remaining four in Column 2. 


qi 2) 3: 
31. ( ) ( ) ( ) President of a Society or Club 
312. ( ) ( ) ( ) Secretary of a Society or Club 
313. ( ) ( ) ( ) Treasurer of a Society or Club 
314. ( ) ( ) ( ) Member of a Society or Club 
315. ( ) ( ) ( ) Chairman, Arrangements Committee 
316. ( ) ( ) ( ) Chairman, Educational Committee 
317. ( ) ( ) ( ) Chairman, Entertainment Committee 
318. ( ) ( ) ( ) Chairman, Membership Committee 
319. ( ) ( ) ( ) Chairman, Program Committee 
320. ( ) ( ) ( ) Chairman, Publicity Committee 


There are separate blanks for men and women. The categories in the 
blank for women are the same as those for men, but of course the activities, 
occupations, etc., are those appropriate for women. Each blank contains 
400 items to be checked. On the basis of the responses a profile is con- 
structed for each individual, showing the resemblances and differences be- 
tween the examinee's pattern of preferences and patterns of successful 
people in particular occupations. The blank for men can be scored for 41 

* Sample items reprinted from Vocational Interest Blank for Men and Vocational In- 


terest Blank for Women, by Edward K. Strong, Jr., with the permission of the publishers, 
Stanford University Press. Copyright 1938 and 1946 by the Board of Trustees of Leland 


Stanford Junior University. 
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occupations; that for women, for 25. The scoring is rather complicated 
and laborious when done manually. It is best done by use of prepared 
answer sheets and the test-scoring machine. 

The Strong Blanks are probably the most carefully constructed and vali- 
dated instruments of this type. In addition to the occupational scales, 
scoring keys have been worked out for six occupational groups such as 
scientific or linguistic, and for certain non-occupational interests. The in- 
dividuals on which scoring scales for the various criterion groups are based 
have been carefully selected from among persons successfully engaged in 
their present respective occupations for at least three years. They number 
in most instances two hundred cases or more. The reliabilities of the sepa- 
rate scales for men are nearly all above .80 and a substantial proportion are 
above .90. The reliabilities of the scoring scales for the women’s blank aver- 
age .86. All were calculated by the split-half method. 

Validity is determined by the fact that men or women entering a par- 
ticular occupation make higher scores on the scale for that occupation than 
on any other; that men or women continuing in an occupation suggested 
by the Strong Blanks make higher scores than men or women entering an 
occupation other than that suggested; that men or women continuing in a 
suggested occupation make higher scores than those who change from that 
occupation to another one; and that a person changing from some other oc- 
cupation to occupation X ten years later made higher scores as a college 
senior on the scale for occupation X or one other occupation than he did on 
the scales for eighteen other occupations.” While such data do not show that 
scores on the blank predict success in an occupation, they do indicate that 
scores are related to occupational preference as judged by entrance into and 
persistence in an occupation. 


Kuder Preference Records — Vocational and Personal 


Another well-known and widely used set of interest inventories are the 
Kuder Preference Records — Vocational C, Vocational B, and Personal A? 
Vocational C measures ten broad areas of educational interest: outdoor, 
mechanical, computational, scientific, persuasive, artistic, literary, musical, 
social service, and clerical. Vocational B measures all these except outdoor. 
The Personal A measures five different kinds of personal preferences referred 
to as sociable, practical, theoretical, agreeable, and dominant. These in- 
ventories call for choices among a wide range of activities as does the Strong, 
but the choices are not grouped into categories such as occupations, subjects, 

1 Edward K. Strong, Jr., Manual for Vocational Interest Blank for Women (Stanford 
University, Calif.: Stanford University Press, 1947), p. 14. 

3 G. Frederic Kuder, Kuder Preference Records — Vocational and Personal (Chicago: 


Mee Research Associates, 1948). Sample items reproduced by permission of the pub- 
usher. 
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etc. The alternatives are presented in groups of three; in each group the 
subject selects the one alternative which he would most prefer and the one 
which he would least like. For example: 


Exercise in a gymnasium 
Go fishing 
Play baseball 


Cook for a hotel 
Cook for people on camping trips 
Cook for a family 


San POW 


The activities in each group are not necessarily in the same category, 
though in most groups they are. The blanks are scored for preferences in 
terms of the areas or fields as listed above; they are not scored for specific 
occupations, though the manual states that work toward this end is in 
progress. 

The scoring of the Kuder Preference Records is very simple. By means of 
a special type of answer booklet the scores of each individual on the group or 
area scales can be readily determined. These scores cay then be transmuted 
into a profile. 

A large amount of research bearing on the validity of the Kuder Records, 
particularly Vocational C, has been published. The general trend of such 
studies has been to show marked differences in scores on the separate scales 
for different occupational groups and for different college majors and cur- 
ricula, and definite relationships to job satisfaction. The reliabilities of the 
various scales, determined by the Kuder-Richardson Formula, are mostly 
between .85 and .90 


While interest inventories are very useful in certain situations, especially 
in guidance and counseling, it is necessary to keep in mind certain facts 
in order to avoid overgeneralizing the results obtained by their use. In 
the first place, the inventories are not aptitude tests. While a similarity- 
of-interest pattern like that of persons successfully engaged in a given 
occupation is undoubtedly a desirable attribute for one considering that 
occupation as his life’s work, such resemblance is not a guarantee of success 
in the occupation. An inventory might reveal a pattern of interests which 
resembles very closely that of successful engineers, for example, but much 
more than an interest pattern is needed to achieve success in such work; the 
same is true of any other profession or occupation, Overenthusiastic or 
uncritical users of these instruments, especially in the schools, should guard 
against advising a student to choose a given occupation solely on the basis 


of scores on an interest inventory. 
In the second place, it should always be kept in mind that interests 
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change, particularly below the age of twenty-five years. When an interest 
inventory is given in the junior high school, for example, the results should 
be considered as very tentative and likely to change markedly, perhaps 
more than once, before the individual “settles down.” It is common 
knowledge that many persons change their goals even after leaving college. 
The results of an interest inventory should always be regarded as provi- 
sional, at least until the individual has attained full maturity. 

Finally, it should be recognized that few persons have the breadth of 
knowledge and experience to make valid choices among such a wide range 
of activities as these inventories present, and that many such choices must, 
at best, be based upon false or very limited information. Moreover, when 
the inventories are used with children in junior high schools it seems likely 
that many of the words in the inventories may pose vocabulary problems. 
If a pupil does not know the meanings of the words or terms, he certainly 
cannot make intelligent choices among them. 

In the light of these considerations, the use of interest inventories with 
subjects below the senior high school level seems questionable. When they 
are used, it should always be with full cognizance of the limitations and 
safeguards that have been pointed out. Interest inventories are valuable 
tools and their use can be recommended particularly for counselors and 
guidance workers provided the results are interpreted and used with ap- 
propriate caution. 


Other Interest Inventories 


1. Brainard Occupational Preference Inventory. 1945. Grades 9-12 and 
adults. Commercial, personal service, agricultural, mechanical, professional, 
esthetic, and scientific. 

One form, $.25 per booklet; record forms, $1.60 per 25. 30 minutes. 

Psychological Corporation. 


2. Cleelon Vocational Interest Inventory. 1943. Grades 9-16 and adults. 
Biological sciences, sales, physical sciences, social sciences, business, literary, 
mechanical, finance and accounting, artistic, elementary teacher, high school 
teacher, personal service, household and factory, homemaking. 

One form for men, one for women. $2.00 per 25, plus $.043 per answer 
Sheet. 50 minutes. 

McKnight and McKnight. 


3. Guilford-Shneidman-Zimmerman Interest Survey. 1948. Grades 9-16 
and adults. Artistic, linguistic, scientific, mechanical, outdoor, business, social,| 
personal, office, 

. One form. $4.00 per 25; answer sheets, $.50 per 25. 50 minutes. 

Sheridan Supply Company. 


4. Occupational Interest Inventory. 1944. Intermediate, Grade 7 to aver- 
age adult; Advanced, Grades 9-12, college, superior adult. Personal-social,' 
natural, mechanical, business, arts, sciences. 
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One form, $.14 per copy. 40 minutes. 
California Test Bureau. 


5. Thurstone Interest Schedule. 1947. High school and college. Physical 
science, biological science, computational, business, executive, persuasive, 
linguistic, humanitarian, artistic, musical. 

One form, $1.65 per 25. 15 minutes, 

Psychological Corporation. 


e Learning Exercises € 


5. Would you expect the Strong Interest Blank to be more appropriate for persons 
above the age of eighteen, and the Kuder Preference Record to be more suitable for 
high school ages? If so, why? 

6. Does the fact that interests of adolescents are often not stabilized make the 
use of interest inventories inadvisable? What are some ways in which they can be 


used to advantage with high school pupils? 
‘7, Compared with other types of tests, there are relatively few interest in- 
yentories. Can you give some reasons for this? 


THE MEASUREMENT OF ATTITUDES 


Attitudes may be considered to be one phase of personality. They are 
closely associated with feelings and emotions, and are a large factor in deter- 
mining our reactions and behavior. An attitude may be thought of as a 
response pattern, or a tendency to think or act in a particular way under a 
given set of circumstances. Thus, a person has established attitudes to- 
wards certain activities, facts, geographical regions, political parties, and 
towards particular individuals such as the principal of his school, his home- 
room teacher, his classmates, etc. When situations arise in which one or 
another of these is involved, he tends to react in each case in a certain way. 
His attitude toward Communists may be strongly antagonistic; toward his 
principal, neutral; and toward the football coach, strongly favorable. In 
the first instance the attitude may be generalized to include all Communists; 
in the second and third the attitude is specific with respect to a single in- 
dividual. In every case there is likely to be some emotional reaction, how- 
ever slight. 

It has already been pointed out that attitudes condition behavior. An 
unfavorable attitude will usually cause a reaction either of avoidance or of 
aggression; a neutral attitude, indifference; and a favorable attitude, a 
seeking behavior. Of course, not all attitudes can be neatly classified as 
favorable, neutral, or unfavorable. Attitudes range by degrees from one 
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extreme to the other, and the use of the three terms is merely for conveni- 
ence. 


The Method of Equal-Appearing Intervals 


The measurement of attitudes is carried out by self-report methods. 
One method is to present to the subject a list of statements expressing 
attitudes varying widely from very favorable to neutral to very unfavor- 
able; the subject is asked to check those with which he agrees. This method, 
known as the method of equal-appearing intervals, was devised some years 
ago by Thurstone and Chave.? A large number of statements of attitude 
toward something, e.g., the Republican Party, are collected. These state- 
ments must vary by fine degrees in the attitudes they express, from ex- 
tremely favorable to extremely unfavorable. A number of competent 
judges sort the statements into eleven piles or groups according to the shade 
or degree of opinion expressed. All statements in one pile are those judged 
to be expressive of the same attitude. Each pile differs from the adjoining 
ones above and below by apparently equal intervals or equal differences in 
attitude. Each judge sorts the statements independently. 

Next, the judges together select from each group the two or three state- 
ments which they regard as most typical of that group and which express 
most nearly the same degree of attitude. When these are assembled there 
are generally twenty-five to thirty statements varying in expressed attitude 
from very favorable through neutral to extremely unfavorable. Each state- 
ment has a scale value according to its position or grouping. Thus, those at 
the most unfavorable end of the scale may each have a scale value of 11, the 
neutral ones 6, and the more favorable ones 5, 4, 3, 2, and 1, in that order. 
The statements are reproduced in random order and the person whose at- 
titude toward the Republican Party is to be measured is asked to check 
those statements with which he agrees. His score or attitude is based on 
the average scale values of those he checks. 

To illustrate this type of scale, a few statements from a scale constructed 
to measure attitudes toward vocational education in secondary schools 
are given below. 


I think that for his own good, every high school student should 
be required to take one shop course. (2.9) 

I think that one course is as good as another. It all depends on 
what you can do and are interested in. (5.4) 

Students in a regular high school should have an opportunity to 
take vocational courses if they want to. (4.2) 

Vocational subjects are taken by many students because they 
require very little homework and outside study. (8.0) 


? L. L. Thurstone and E. J. Chave. The Measurement of Attitude (Chicago: University 
of Chicago Press, 1929). 
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I think that academic courses should all be elective and that only 
vocational courses should be required. (1.0) 

I see no value to anyone in vocational subjects; they are an ab- 
solute waste of time. (10.5) 

I think that vocational subjects are usually spoiled by the prac- 
tice of “dumping” failures from other subjects into them. (7.7) 


These statements represent a sampling from some thirty that constituted 
a scale of attitudes toward vocational subjects in high school. The scale 
value of each statement is given in the parentheses following it. These 
values are generally not shown on the scale that is checked by the person 
taking the test. This scale, one of four devised for use with high school stu- 
dents, was developed by following very closely the procedure of Thurstone. 


The Lickert Method 

The Likert method! of measuring attitudes is somewhat less time- 
consuming than that just described. It, too, begins with a considerable 
number of statements of attitude toward something. However, in this 
case they are either decidedly favorable or decidedly unfavorable, Each 
statement has five (or three) possible responses: SA, Strongly Agree; A, 
Agree; U, Undecided; D, Disagree; SD, Strongly Disagree. The person 
taking the test reacts to every statement by marking one of the five possible 
responses. The responses have weights of 5, 4, 3, 2, and 1 for favorable 
statements, and 1, 2, 3, 4, and 5 for unfavorable. The subject's score is the 
sum of the weights of the responses he checks. A high score indicates a 
highly favorable attitude, a low score the opposite. 

The Likert method eliminates the sorting by judges and therefore it 
requires less time to prepare a scale than the method of equal-appearing 
intervals. It also uses more statements, as a rule, and the subject is re- 
quired to check all of them, both of which factors tend to increase the re- 


liability of scores. 
A sample set of statements set up by the Likert method might read as 


follows: 


SA A U D SD All Mexicans are dirty. 

SA A U D SD Mexicans are intelligent, industrious 
people who have not had opportunity 
to develop. 


SA A U D SD Mexicans should not be permitted 
to enter the United States. 


Other methods used for measuring attitudes require the subject to react 


10 Rensis Likert, “A Technique for the Measurement of Attitudes,” Archives of Psy- 
chology, Volume 22, No. 140 (1932). 
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to pictures, and still another method requires a series of choices from paired 
alternatives such as salesman and mechanic, banker and professor, owner 
and operator, etc. However, the two methods that have been described 
above are probably the best of those developed to the present time. Both 
give fairly high reliabilities for the type of measurement. Correlations be- 
tween scores on comparable scales of the two types are reported to be quite 
high. 

One of the chief problems in connection with attitude scales is their 
validity. As with all self-report instruments, the value of the score is de- 
pendent upon the cooperation of the person taking the test. It is very easy 
for him to simulate an extreme attitude if he wishes to do so, simply by 
checking all the strong statements one way or the other, or by strongly 
agreeing or strongly disagreeing with all the statements of one type or an- 
other. Generally this is much easier to do in an attitudes scale than in a 
personality inventory. In the latter the implications of specific statements 
are not always obvious. Unless the subject is honestly trying to cooperate 
when he checks the attitudes scale the results are of little or no value. 

In the second place, what a person agrees or disagrees with on paper is 
not necessarily a reflection of how he really feels or acts. There is no way 
of determining whether or not the subject is honestly expressing what he 
believes, Furthermore, what he endorses on the test is one thing, but his 
actual behavior in the same or a similar situation may not be consistent 
with his verbal responses. Some research has been done on this with 
widely varying findings. Some studies report substantial correlations be- 
tween scores on an attitudes scale and observed behavior; others report 
negligible correlations. Corey, for example, found practically no correlation 
in a college class between scores on a scale of attitude toward cheating and 
actual behavior in an examination.” 

Much of the research suggests that there is a positive correlation in the 
neighborhood of .50—.60 between scores on attitude scales and actual per- 
formance or behavior. This is not a large amount of relationship, but it 
does indicate a substantial tendency. The ultimate validity of attitude 
scores depends on how well they correlate with action. It may be interest- 
ing and in some instances useful to know what an individual's verbalized . 
attitudes are, but unless they can be used to predict how he will act such 
data are of limited value for practical purposes. In this respect there is 
still much to be accomplished in the area of attitude scales. 


t Allen L. Edwards and Kathryn C. Kenney, “A Comparison of the Thurstone and 
Likert Techniques of Attitude Scale Construction," Journal of Applied Psychology, 
30:72-83 (February, 1946). 

? Stephen M. Corey, *Professed Attitudes and Actual Behavior," Journal of Edu- 
cational Psychology, 28:271-80 (April, 1937). 
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8. List some subjects such as school, scholastic athletics, high marks, or senior 
trips, toward which a high school faculty might wish to test attitudes. Name 
some that elementary teachers might be similarly interested in. 

9. Select one of the above and write out ten statements expressing different 
attitudes toward it that might become part of a scale, following either the Thurstone 
or the Likert plan. 

10. What are some of the issues for which scales of attitudes have been devised? 
Find in the literature on educational measurement a report of one such project and 
prepare an abstract of it. 

11. Can attitudes be changed? If so, how? Of what value are tests or scales 
of attitude in such attempts? 
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The Measurement of Personality and 


Adjustment: Observational Techniques 


In Chapter 11 various self-report approaches to the measurement of 
personality and adjustment were presented. In this chapter we shall con- 
tinue and conclude the discussion of the measurement of personality and ad- 
justment with a consideration of observational techniques. Observational 
techniques employ information supplied by sources other than the indi- 
vidual being studied. The techniques which will be discussed here are rat- 
ing scales, systematic observation, anecdotal records, and sociometric 
methods. Each will be described and illustrated. There will also be a 
brief discussion of the effective use and the particular advantages and dis- 
advantages of each technique. 


RATING SCALES 


'The basic purpose of rating devices is to obtain systematically and 
objectively a sampling of opinion on certain characteristics of a given 
individual. Such judgments should be obtained from people who are well 
acquainted with the person being rated, and who can express accurate and 
dependable opinions. In order to obtain satisfactory results it is essential 
to follow certain well-established and tested procedures. Among other 
things, it is necessary to define the traits or characteristics on which the 
ratings will be based, to provide some kind of scale or range by which the 
rater can indicate his judgment of the amount or degree of the trait, and 
to give the rater some specific and carefully worked out instructions regard- 
ing the purposes and use of the instrument. In addition, it is highly desir- 
able to meet with the persons who are going to do the rating in order to 
discuss the use of the device with them, and, if possible, to give them some 
practice in using it. Instruction and information in addition to that which 
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is printed on the scale must be given'to persons untrained in the use of 
rating scales if the results are to be valid and reliable. 

In constructing a rating scale the first step is to divide the broad area to 
be analyzed into specific traits or characteristics. To ask for ratings in such 
wide areas as “personality” or “adjustment” would give results that are 
practically meaningless. Therefore, the concept of "personality" should 
be broken down into more specific and definable terms such as “ persist- 
aggressive tendencies," “ Z 


» € 


ence, cheerfulness, 
sourcefulness." 

When such an analysis has been made, the next step is to define each of 
these traits in terms which will be meaningful to the rater and which will 
convey similar meanings to different persons using the scale. This is a 
difficult task and, of course, it is never possible to be sure that one has suc- 
ceeded. However, if the traits are clearly defined in terms of behavior, 
rather than vague abstractions, it helps materially to insure that those 
using the scale will have common understandings of the traits being rated. 

Finally, the specific traits should be defined in such a way that the defi- 
nitions provide descriptions of the varying degrees of each trait and lead 
the rater to make quantitative judgments rather than vague, meaningless 
generalizations. 


» od 


generosity," or ‘“‘re- 


BEC Personality Rating Schedule 

An excerpt from a graphic rating scale of personality is shown in Figure 9. 
It clearly illustrates the principles of construction and organization that 
have just been discussed. 

This rating scale, or rating schedule, as it is called, provides opportunity 
for ratings on twenty-nine traits grouped in eight categories: Mental 
Alertness, Initiative, Dependability, Cooperativeness, Judgment, Personal 
Impression, Courtesy, and Health. A score may be obtained on each trait 
and these scores may be averaged within each category or for all twenty- 


nine traits, if desired. 


Michigan Department of Mental Health Rating Scale for Pupil Adjustment 


Another rating scale, set up somewhat differently, is the Michigan De- 
partment of Mental Health Rating Scale for Pupil Adjustment. This pro- 
vides for ratings on eleven characteristics or traits: Over-all Emotional 
Adjustment, Social Maturity, Tendency Toward Depression, Tendency 
Toward Aggressive Behavior, Extroversion-Introversion, Emotional Se- 
curity, Motor Control and Stability, Impulsiveness, Emotional Irritability, 


1 Michigan Department of Mental Health Rating Scale for Pupil Adjustment (Chicago: 
Science Research Associates, 1953). Quotation from this scale by permission of the 


publisher. 
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V. JUDGMENT 
1, Sense of Values 


Is unfailingly keen of 
insight in distinguishing 
the important from the 
unimportant in class- 
work 


5 


2. Deliberativeness | Always considers care- 


4. Worth of 
Opinions 


(Reprinted by ay 


The President anc 


fully all aspects of 
problem situation be- 
fore proposing solution 


5 


Extremely gifted in 
discerning the best 
thing to do or say 
when dealing with 
others; never gives 
any offense 


5 


His opinion invariably 
sought by colleagues 
in deliberative assem- 
blies 


Fig. 9 


Excerpt from BEC Personality Rating Schedule 


Generally distin- 
guishes the important 
from the unimportant 
in classwork even when 
confusion might be 
easy 


Usually considers all 
‘important aspects of 
problem situation be- 
fore proposing solution 


Usually says or does 
the suitable thing 
when dealing with 
others 


His opinion usually 
valued by colleagues 
in deliberative assem- 
blies 


Distinguishes satisfac- 
torily between the im- 
portant and the unim- 
portant in classwork 


Seldom proposes so- 
lution to important 
problem situation with- 
out some preliminary 
analysis 


Only rarely gives any 
offense through ill- 
considered speech or 
action 


His views generally 
accorded a courteous 
reception 


Occasionally confuses 
the important with the 
unimportant in class- 
work 


Sometimes proposes 
solutions to problem 
situations without any 
preliminary analysis 


Sometimes says or 
does the wrong thing 
when dealing with 
others 


His opinion not gen- 
erally sought by col- 
leagues 


Commonly neglects 
crucial issues in class- 
work through attention 
to the unimportant 


ls constantly jumping 
at conclusions 


Frequently gives of- 
fense through lack of 
discernment in speech 
or action 


His opinions accorded 
little esteem in delib- 
erative meetings 


ermission of the publishers from Philip J. Rulon and others, BEC Personality Rating Schedule; Cambridge, Mass.: Harvard University Press. Copyright, 1936, by 
Fellows of Harvard College.) 
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School Achievement, and School Conduct. In addition, there is an op- 
portunity to check physical conditions of the child such as height, weight, 
and physical handicaps and defects. The sample below shows how this 
scale is set up. 


IX. Emotional Irritability 
(Definition: Tendency to become angry, irritated, or upset.) 
A. Usually good-natured. 
B. Good-natured — rarely irritable. 
C. Fairly good-natured — occasionally irritable. 
D. Moderately irritable — frequently shows moderate irritation. 
9. Extremely irritable — frequently shows marked irritation. 


Aver T CR DE FE 


Obviously, this is not a graphic rating scale in the usual sense of the term. 
There is no line or continuum upon which a check mark can be placed 
anywhere according to the rater’s best judgment. Instead, there are five 
levels for each trait, and the rater simply checks the one of these which is 
most appropriate. He cannot signify an in-between rating. In the opinion 
of some persons this is a disadvantage in that it does not permit as much 
differentiation as the graphic scale. 

A score is obtained on each trait by multiplying A ratings by 5, B’s by 4, 
C's by 3, D's by 2, and E’s by 1. These products are added to get a total 
score. The authors state that the best index of Total Emotional Adjust- 
ment is a score based on a combination of four traits, namely, Over-all 
Emotional Adjustment, Social Maturity, Emotional Security, and Impul- 
siveness. It is recommended that this score be used as the adjustment 
criterion until results of further research on validity are available. Ratings 
on various combinations of traits may be used to obtain scores in other 
areas such as Aggressive Behavior or Inhibitory Control. 

The rating scale is suggested for use as a screening device. After pupils 
have been rated by their teachers and the scores calculated, a distribution 
of scores is made. Pupils scoring in the lower third or lower fourth of the 
distribution may be referred to the proper clinical services for further 
study and therapy. The proportion suggested for referral will depend on a 
number of factors such as the quality and availability of clinical services, 
and the character of the school population with respect to such factors as 
culture, socio-economic status, and geographical area. 


Rating scales and devices may be used for purposes other than rating 
personality and adjustment. For example, they may be used to rate per- 
formance on a job, the quality of a product such as a cake or a lampstand 
made by a pupil, or the quality of handwriting. The use of rating scales 
for such purposes has already been discussed in a previous chapter. 
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In closing the discussion of rating scales it may be helpful to present a 
few suggestions regarding their effective use. We mentioned earlier that it 
is not safe to assume that anyone can use a rating scale properly and ef- 
fectively without instructions; indeed, it is generally recognized that some 
instruction is necessary if the results are to be of value. 

Some of the common errors in using rating scales may serve as a starting 
point in developing suggestions for effective use. A frequent cause of error 
is the halo effect. This refers to the tendency of the rater to let his general, 
over-all impression of the person being rated influence his ratings on every 
trait. If he likes the subject, he will tend to rate favorably on everything; 
if he is not favorably disposed toward him, that likewise tends to color all 
his ratings. 

Another common error is the tendency to avoid the ends of the scale, that 
is, to avoid rating persons very high or very low. This is sometimes re- 
ferred to as the error of central tendency and is likely to occur where raters 
are not well acquainted with the persons being rated. A similar type of 
tendency is known as the generosity error, which refers’ to the practice of 
rating everyone average or above. When this happens no one gets a rating 
below the middle of the scale, a fact which is manifestly unrealistic in most 
situations since there are usually as many below the average in a given group 
as there are above. 

Another common error is called the stereotype error. This means that 
some raters will have preconceived ideas regarding members of certain 
groups — racial, religious, economic, or occupational — and will tend to 
rate them accordingly. 

There are other types of errors in using rating scales, but the foregoing 
are among the most common and serious. Below are a few suggestions 
which should help to counteract, if not entirely overcome, such error tend- 
encies, 


Suggestions for Users of Rating Scales 


1. Rate each member of a group in comparison with all the others in his 
group. If only one person is being rated compare him mentally with others 
of his same level, class, occupation, etc. Do not rate him on the basis of 
some ideal that exists only in your imagination, or on the basis of some un- 
realistic and unattainable standards. 


2. Rate each person on one trait before going to the next. For example, 
if there are 35 pupils to be rated on 10 traits, rate all 35 on Trait 1, then all 
35 on Trait 2, and so on. This is believed to make ratings more accurate 
and dependable in that the rater concentrates on one trait at a time and 
compares each member of the group with all the others on the same trait 
before considering another trait. 


t 
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3. Wherever possible use multiple ratings. Thatis, have several teachers 
or observers rate the same pupils without consulting each other. Ratings 
which are made independently by several raters and then considered col- 
lectively are much better than single ratings by one individual. 


4, In making ratings try to think of the individual’s behavior in as many 
different situations as possible. Isolated incidents, while perhaps very 
striking, are not always typical of his usual behavior. 


5. Do not rate individuals on traits or categories for which you cannot 
cite specific evidence or behavior to support your rating. If you have no 
basis for making a judgment, do not rate. Rather, leave that item unmarked 
and note “NOTO” (No Opportunity To Observe). A false or inaccurate 
rating is worse than none at all. 


6. If you are responsible for obtaining ratings, give those who are to 
make them some instruction and assistance. A staff meeting or two could 
well be devoted to the development and discussion of such points as the five 
preceding. 


e Learning Exercises 9 


1. Devise a short graphic rating scale for the five traits of Industry, Perseverance, 
Courtesy, Emotional Stability, and Sociability. Try to define degrees of each trait 
in terms of observable behavior. 

2. Write out a set of instructions which will help fifth-grade teachers use the 


scale correctly. 


Other Rating Scales 

1. Haggerty-Olson-Wickman Behavior Rating Scale. 1930. Grades K-12. 
Intellectual, physical, social, and emotional traits. 

One form, $2.30 per 35. 

World Book Company. 

2. KD Proneness Scale. 1950. Grades 7-12. Delinquency proneness 
(truancy record, home background, attitude toward school, club membership, 
family mobility, etc.). 

One form, $3.45 per 35. 25 minutes. . 

World Book Company. 


3. New York Rating Scale for School Habils. 1927. Elementary and high 
school. Ratings on nine traits descriptive of school habits. 

One form, $1.25 per 35. 25 minutes. 

World Book Company. 

4. Pupil Adjustment Invenlory. 1951. Grades K-12. Consists of a rating 
scale for each of a number of academic, social, emotional, physical, interest, 
school, and family-background characteristics as they are related to the adjust- 
ment of the pupil. 
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Short and long forms. $2.70 per package of 35 short forms, 5 long 
forms. 
Houghton Mifflin Company. 


5. Vineland Social Maturity Scale. 1946. Birthday to maturity. Self-help, 
self-direction, occupation, locomotion, communication, and socialization. 

One form, $1.65 per 25. 30 minutes. 

Educational Test Bureau. 


OBSERVATION 


Although each of the methods described in this chapter involves observa- 
tion, there is an observation technique which has several features that merit 
individual consideration. The observation technique has been developed 
primarily in connection with child study. Nursery schools and kinder- 
gartens, particulary where these are part of a laboratory or demonstration 
school, are commonly equipped with one-way-vision screens so that children 
may be observed without their seeing the observers or knowing that they 
are being observed. Some efficient and dependable procedures for making 
and recording such observations have been developed through experience 
and research, and these procedures will be considered briefly as a means of 
evaluating behavior, personality, and adjustment. 

As in the case of rating scales, one basic principle in observation is to de- 
fine the behavior to be observed. It has not been found very useful or satis- 
factory just to “observe” children. It is much more productive first to 
define what is going to be observed, and then to concentrate on observation 
in terms of the definition established. For example, suppose one were in- 
terested in making a study of personality traits in a group of four- and five- 
year-olds. The first step would be to identify and define the traits to be 
observed — such traits, for example, as cooperative behavior. What con- 
stitutes cooperative behavior at this age? Probably a dozen or more kinds 
of behavior (sharing toys, helping the teacher, picking up, etc.) could be 
thought of that would be evidence of cooperation among five-year-olds. 

When the particular characteristic in question has been analyzed and 
divided into specific acts or behavior patterns, these elements are listed on a 
schedule or check list which the observer uses as a means of recording ob- 
servations. Each time a particular behavior is observed it is recorded on 
the check list. In addition, cooperative behaviors not listed can be added 
as they occur. 

A second principle is that there should be frequent and distributed ob- 
servations, This means that it is better to divide the total observation 
time per child into smaller amounts for frequent observation than to use it 
allin one or two observations. Assuming, for example, that the observer 
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has two hours of observation time per child, what is the best way to use it? 
It can be used in 1 two-hour block, 2 one-hour periods, and so on, down to 
120 one-minute observations, or even shorter and more frequent ones. 
While no arbitrary, hard-and-fast rule can be given, it is generally agreed 
that a total of two hours divided into 24 five-minute periods would be 
preferable to longer and less frequent observations. Some investigators 
would use shorter and more frequent periods than this. In general, expert 
opinion seems to favor frequent, short observations distributed over a 
period of several weeks and falling at different times of the day. The chief 
advantage of such a plan is that it is likely to yield a more adequate sample 
of a child’s behavior and thus reduce the chances of getting erroneous im- 
pressions from a long observation on what might be a very non-typical day. 
Rotating the time of observation so that the same child is not observed 
at the same time of day each observation period reduces the probabilities 
of getting consistently biased samples of behavior at a particular time of 
day, such as just before lunch. 

It must be recognized, of course, that longer observation periods may 
be preferable under certain conditions or for certain purposes. This is 
particularly true where sequence of behavior is to be studied and where the 
development from beginning to end of certain behavior situations is to be 
observed. 

Instead of defining and concentrating the observation on a specific be- 
havior, one may keep a continuing or running account of the total behavior 
of a given child over a period of several days. This procedure has the ad- 
vantage of giving a more complete picture of the child, though it usually 
lacks the objectivity of the other method and it does not yield data which 
can readily be expressed in quantitative terms, such as a count of the 
number of times the defined behavior occurred. 

Much depends on the training and the skill of the observer. He must 
be able to observe and record objectively, keep personal bias out of his 
observations, and distinguish clearly between observation and interpreta- 
tion. He should record only what happens and do so as promptly as he 
can so that he does not have to rely too long on memory for important data. 
It is generally best to concentrate on obtaining a complete and accurate 
record at the time of observation and to make interpretations of the record 
later when there is more time for careful study. 

An illustration or two of the observation method should serve to make it 
more definite and meaningful. One of the earliest applications of the time- 
sampling method of observing behavior in young children was a study 
reported by Olson in 1929.2 He observed nervous habits in elementary 


? Willard C. Olson, The Measurement of Nervous Habits in Normal Children, Institute 
of Child Welfare Monographs, No. 3 (Minneapolis, Minn.: University of Minnesota 
Press, 1929). 
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school children by using five-minute observation periods for each child and 
recording the incidence of nervous habits (such as nose-picking, twitching, 
etc.). Although later investigators have improved upon and refined his 
procedure, Olson’s was one of the first to yield quantitative data based on 
systematic time samples of children’s observable behavior. 

One more illustration of the observation technique will show how it can 
be used for evaluative purposes in the classroom. An experiment was con- 
ducted in seventy elementary schools in New York City to compare the 
effects of “activity programs” with a more conventional type of program. 
The methods used to evaluate the results of the experiment included a wide 
variety of tests, anecdotal records, and observations. The observers, 
after a period of orientation and training, carried out a series of half-hour 
observations on each class in the experimental and the control schools. 
The observer recorded by use of a code each occurrence of pupil activity 
which the experiment was designed to encourage. The number of observa- 
tions varied from six to fourteen per class. When observations were checked 
by having two observers present at the same time, a substantial amount of 
agreement (averaging above 85 per cent) was found. The results of the ob- 
servations showed a distinct superiority for the activity-program schools 
in the number and variety of pupil activities. The control classes showed 
a reliable superiority in recitational behavior. 

It should be mentioned before closing this discussion of the observational 
method that the importance of training and skill in observation can scarcely 
be overemphasized. People vary greatly in their ability to observe and 
report accurately on what they have seen. It is well known, for example, 
that witnesses to an accident may give diametrically opposite accounts of 
what took place. Even under less strained conditions observers in a lab- 
oratory or in a theater may differ in the accuracy of their observations, even 
though they have witnessed the same circumstances or events. 

In research the observers are usually carefully selected on the basis of 
tests, and they are throughly trained. Furthermore, their reports are 
checked against those of other observers for agreement and consistency. 
Data gathered under such conditions are likely to be acceptable in validity 
and reliability. However, there are many situations in which such pre- 
cautions and safeguards are impractical, even though systematic observa- 
tions are desirable. Observations must often be made by classroom teach- 
ers who are relatively untrained for this work. Not only where educational 
experiments are being conducted in the schools, but in the daily activities in 
the classroom, much of our information about pupils and activities is based 


3 A. T. Jersild, R. L. Thorndike, B. Goldman, and J. J. Loftus, “An Evaluation of 
Aspects of the Activity Program in the New York City Public Elementary Schools,” 
Journal of Experimental Education, 8:166-207 (December, 1939). 
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on observation by teachers. Observation by classroom teachers is generally 
informal and unsystematic and is carried on without benefit of check lists 
or planned procedure. Anything that can be done to make teachers more 
reliable and accurate observers will inevitably add to our understanding of 
children and will help them to make better adjustments and develop more 
wholesome personalities. To become better observers teachers must first 
of all have a genuine interest in improvement; they must be willing to ac- 
cept instruction and assistance, and they must be willing to have their ob- 
servations checked for accuracy and dependability. Students of science 
are trained constantly and rigorously in careful observation. It seems quite 
as important that prospective teachers be given similar training in observa- 
tion, since much of what is known about children is based upon observation 
by their teachers. 


* Learning Exercises € 


3. Devise arecord sheet in the form of a check list that might be used in observing 
and recording evidence of aggressive behavior in kindergarten children. 

4, Devise a similar form for recording observations of changes in behavior as a 
result of instruction in a unit on personal hygiene in ninth-grade general science. 


ANECDOTAL RECORDS 


The method discussed in the preceding section is a systematic procedure 
for gathering observational data, and is used more often in research than 
in everyday classroom situations. A method used more frequently and 
informally is the anecdotal record. This is the teacher’s written record of 
an occurrence or incident involving a child. For example, the following 


might be typical: 


Grave 5 — Miss Jones 


9/15/57 A new boy, Jimmy Long, came to school this morn- 
ing. He is large and strong for his age. I heard him 
telling some of the other boys during recess about his 
dad who is a professional baseball player. (He didn't 
seem to be boasting — just proud of his dad.) 

9/18/57 Jimmy, the new boy, got into an argument with Bob 
about a percentage problem dealing with baseball 
batting averages. Later, Jimmy hit John with a ruler 
and John cried. I made Jimmy say he was sorry. 
(Jimmy may be inclined to bully. Time will tell.) 
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These are samples of what Miss Jones might record as significant anec- 
dotes concerning a new pupil. They illustrate some of the generally ac- 
cepted principles of making anecdotal records. 

First, it should be noted that the recordsis in two parts — the incident it- 
self and, in parentheses, Miss Jones's interpretation. This is desirable 
and important if the records are to be maximally useful. When fact and 
interpretation are mingled it makes the records less objective and more 
difficult to interpret. Also, different persons may interpret an anecdote 
differently. 

Second, the examples suggest that Miss Jones will keep a continuing 
record on Jimmy over a period of time and in a variety of situations. By 
this means she will secure a quite complete and accurate picture of Jimmy’s 
personality and will certainly be in a good position to give him help if he 
needs it. 

Third, the samples indicate what Miss Jones considers to be significant 
aspects of Jimmy’s behavior. There are undoubtedly many other occur- 
rences which she might have recorded, but these seemed to her to give the 
most insight into his personality during the first few days of school. 

It is not easy to know what is significant and worth recording. Teachers 
inevitably have preferences and dislikes among their pupils, often without 
being fully aware of them, and these biases tend to influence the choice of 
children about whom anecdotes are recorded, as well as the nature of the 
anecdotes. It is quite natural, also, to overlook the shy, quiet child and 
to record anecdotes only on the more aggressive children. The observation 
that “ Ruth sat in her seat and looked out the window while the other chil- 
dren came up to the desk to see the turtle,” may be just as significant as the 
fact that “David brought a live turtle to school today which attracted a 
great deal of attention to him as well as to the turtle.” 

A recurring question on the matter of anecdotal records concerns the num- 
ber of children on whom to keep such records and the number of anecdotes 
to record. Some authorities advocate that anecdotes should be regularly 
recorded for all children. Ideally this is certainly desirable, but in most 
situations it would be impractical for busy teachers. If the goal were even 
one anecdote per week per pupil this would mean something like 35 or 40 per 
year per pupil. A teacher with 35 pupils in her room would have 35 X 35, 
or 1,275 anecdotes per year to record and this would be no small task in 
itself, to say nothing of the time required for the interpretation and use of 
the anecdotes. In this connection it should be noted that keeping anec- 
dotal records presents quite different problems for elementary and high 


4 Arthur E. Traxler, The Nature and Use of Anecdotal Records, Educational Records 
Hoe Supplementary Bulletin D, Revised (New York: Educational Records Bureau, 
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school teachers. The former has about 35 pupils in her room all day every 
day, and this provides ample opportunity to observe significant happenings. 
The high school teacher, on the other hand, may see 150 pupils every day 
for only one period each. Observing and recording significant anecdotes 
in this case is obviously more difficult. 

In most situations it is probably best to begin by keeping records on a 
few pupils. As the teacher gains some experience and confidence he can 
undertake the recording of anecdotes on additional pupils. Even with a 
modest beginning, however, the teacher should take care to avoid the com- 
mon mistake of keeping records on “problem cases" only. As we have al- 
ready suggested, the shy, reticent child may be just as much in need of 
study and help as the one who is always causing trouble. In the beginning, 
when records are kept on only a few children, it would be well to select the 
children with a view to including some of the less obvious cases as well as 
some who demand attention. 

Records of anecdotes should be made as soon as possible after the incident 
has been observed, but never so that pupils are aware that this is being 
done. Many teachers find it best to make a few notes at the first opportu- 
nity and then to make a complete record during free time at noon or after 
school. Anecdotes should always be recorded on the same day if this is at 
all possible, for the longer the time elapsed between the occurrence and the 
recording of it, the less distinct and accurate one’s memory of the incident 
becomes. Anecdotes are probably best recorded on cards. Each anecdote 
may be recorded on a single small card, or, as is sometimes preferred, several 
anecdotes may be recorded on one large card. The latter system makes the 
interpretation of trends or developments a little easier for some teachers, 
In any case, cards are the most convenient means of record-keeping since 
they are easy to file, sort, handle, and arrange. 

To be most useful, anecdotal records on a pupil should be kept over an 
extended period of time. To obtain a reliable sample of a child’s behavior 
and to make any useful assessment of changes that may occur, it is essential 
that an adequate number of anecdotes or observations be made. These 
principles apply here no less than in the case of systematic observations 
discussed in the preceding section. It is of little value to record an anecdote 
or two about an individual and then neglect him fora month. By observing 
him long enough to see how he functions in a variety of situations from day 
to day it is possible to gain much better insight into his personality and 
whatever difficulties he may have. The only exception to this principle 
might be in a school system where anecdotal records are a part of the regular 
cumulative records kept on all pupils. In such a system it would probably 
be impossible to record anecdotes frequently and regularly on every child; 
occasional anecdotes would have to suffice. Nevertheless, the principle 
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still holds that the more frequent and regular the anecdotes recorded for a 
given individual, other things being equal, the better we can understand 
and help him. 

Tf anecdotal records are to be valuable they must be used, and if they are 
to be used they must first be interpreted. To interpret the records it is of 
course necessary to study and summarize them. Several anecdotes on a 
single pupil must be studied and compared. They tell a story, reveal char- 
acteristic behavior, show the individual in his interaction with others in a 
natural setting. In these respects anecdotal records have certain advan- 
tages over other methods such as ratings or systematic observation. How- 
ever, the task of summarizing and interpreting is not an easy one, and it in- 
variably takes much thought and time. Ordinarily, summarizing should be 
done often enough to keep abreast of typical behavior and developments 
of the individual, and yet not so frequently that the process will become too 
much of a chore. Perhaps two or three times per year is often enough under 
ordinary circumstances. However, in individual cases it may be desirable 
to summarize and interpret more frequently. 

Summarizing and interpreting are usually best done by the teacher who 
has written the anecdotal record, though this may also be done by a com- 
mittee of two or three teachers, especially in difficult cases. The guidance 
teacher, counselor, or the school psychologist may be brought into the 
picture if needed. 

The anecdotal record and summaries should be passed on to successive 
teachers as the pupil progresses so that each teacher will have the benefit of 
previous observations and can add to them. By this means a quite com- 
plete and valuable “behavior journal” of a pupil may be built up over a 
period of years. 

It probably goes without saying that anecdotal records should always 
be related to all the other available information concerning a given child. 
Information on home conditions, health, ability, success in schoolwork, 
participation in extra class activities, etc., should be considered along with 
the anecdotal records, and the whole taken into account in any interpreta- 
tions that are made. 

One factor which often causes difficulty in maintaining a system of anec- 
dotal records is the clerical work involved. Reference was made earlier 
to the time involved in recording even one anecdote per pupil per week. 
Where a schoolwide program involving hundreds of children is maintained, 
the total amount of time and labor can grow to large proportions. Never- 
theless, something of this nature has been tried and found feasible, at least 
in one school system.’ Six teachers in Grades 4 to 7 recorded anecdotes 
for three months on every pupil. During this time an extensive testing 


5 Arthur E. Hamalainen, An Appraisal of Anecdotal Records, Contributions to Edu- 
cation, No. 891 (New York: Teachers College, Columbia University, 1943). 
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program was carried out with the same pupils. At the end of the three 
months the anecdotal records were compared with test results to determine 
how well the two types of data agreed. Although the findings are too exten- 
sive to cite in detail here, it was found that teachers could keep such records 
without much difficulty and that they were able to judge social relations of 
pupils accurately by comparing anecdotal records with test results. Among 
the most significant conclusions were: (a) The success of anecdotal records 
depends in large measure on the outlook of the teachers. Those having 
a formal, academic viewpoint will probably find little use for anecdotal 
records, and those they write will be of little value. (b) Unless the child 
has opportunity for many varied experiences it is probable that little useful 
information about that child will be found in anecdotal records. (c) Classes 
enrolling from 17 to 28 show no appreciable difference in the number of 
anecdotes recorded. 

The results of this study are encouraging. They suggest that keeping 
anecdotal records, at least in the elementary grades, is not an inhuman 
task, that teachers who are interested in studying problems of pupil adjust- 
ment can do it, and that the results seem to bear significant relationships 
to other measures of personality and adjustment. 


e Learning Exercises 9 


5. What are the purposes of anecdotal records? 
6. What are the advantages and disadvantages of anecdotal records in com- 


parison with other methods of personality appraisal? 
7. Write out five or six anecdotes about a pupil whois shy and withdrawn. Dothe 
same for one who is overly aggressive. Write your interpretations of each set. 


SOCIOMETRIC METHODS 


The last of the observational methods to be discussed are those known as 
sociomelric, sometimes referred to as inlerpersonal, methods. "The instru- 
ment usually associated with these is called a sociogram. It is a pictorial or 
graphic representation of relationships of a specified nature among mem- 
bers of a group, and is based on information gathered from members of the 
group. Thus it differs from other observational methods discussed in this 
chapter in that the data are collected about individuals from their peers, 
rather than from teachers or other observers. 

The use of sociometric techniques probably dates from the now famous 
work of Jacob L. Moreno, first published in 1934.° Much study has been 

6 Jacob L. Moreno, Who Shall Survive? (Washington, D.C.: Nervous and Mental 
Disease Publishing Company, 1934). 
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made of the sociogram during the last twenty years, and it has been widely 
used in schools. The sociogram has proved valuable when properly used, 
but it has definite limitations and some dangers in the hands of those not 
adequately prepared. 

A sociogram is generally based upon the written answers to a question 
put to members of a group. For example, a fifth-grade group might be 
asked, “ What boy or girl would you most like to have work on a committee 
with you?” Or, “What boy or girl would you want to sit with this semes- 
ter?" This type of inquiry may be expanded to include a first, second, and 
even a third choice. The question can also be put in a way that requires 
the individual to react to every member of the group in a form of rating, 
as in a social distance scale. In this form each member of a group places 
every other member in one of several categories as, “ Like very much," or 
“Okay,” or “Don’t want to be with him any time." However, the data 
obtained by use of the latter type of scale do not lend themselves as readily 
to organization into a sociogram as those obtained by the first type of 
question. 

Once the data have been collected, the next step is to tabulate them in a 
form that is useful. There are two methods commonly employed; one is 
to put them in a table, and the other is to construct a sociogram. Either 
method or both may be used with a given set of data. An illustration will 
help to make this clear. Suppose a fifth-grade teacher has asked 17 pupils 
to nominate their first and second choices of members of the class to work 
with on a certain project. The responses of the pupils may be tabulated as 
shown below: 


Chooser First Choice Second Choice 
Ken Guy Fran 
Jack Kevin Len 
Helen Fran Kathy 
Ted Mike Kevin 
Fred Ken Karl 
Fran Jane Milly 
Jean Kathy Kevin 
Sally Milly Jane 
Jessie Helen Fran 
Karl Fran Ken 
Mike Jean Kevin 
Len Fran Ken 
Guy Karl Ken 
Milly Fran Jane 
Kathy Helen Jean 
Jane Sally Fran 


Kevin Ted Jean 
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These choices may be organized into a table somewhat like the following: 


Ken —. 
Jack 
Helen 
Ted 
Fred 


Fran 


Jean 
Sally 
Jessie 
Karl 
Mike 


len 


Guy 
Milly 

Kathy 
Jane 


Kevin 


First choice 


Second choice 


Here each pupil's first choice is indicated by a 1 under the name of the 
pupil chosen, and the second choice by a 2. 

The number of times each pupil is chosen as a first choice and as a second 
choice is shown at the bottom. Those not chosen at all have no numbers in 
these rows. Mutual choices are indicated by an asterisk. For example, 
Guy is Ken’s first choice, and Ken is Guy’s second choice. 

A better way of showing the relationships in this group is with the socio- 
gram shown in Figure 10, page 314. 

There are several methods of constructing such a chart, but to discuss 
these in detail is beyond the scope of this book. One of the easiest and most 
practical methods is described in a publication dealing explicitly with this 
matter,” and any teacher or counselor can learn the method with a little 
study and practice. 

So far, we have dealt with methods of collecting sociometric data and 
ways of tabulating and summarizing them. Perhaps a more important 
question concerns the purposes for which the results may be used, and a 


1 Horace Mann-Lincoln Institute of School Experimentation, How to Construct a 
Sociogram (New York: Teachers College, Columbia University, 1947). 
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Figure 10 


Sociogram Showing First and Second Choices 
of 17 Fifth-Grade Pupils 


Leceno: fo 


——— FIRST CHOICE 

sese.. =en > SECOND CHOICE 
<—— > MUTUAL FIRST CHOICE 
<------ > MUTUAL SECOND CHOICE 


further question concerns the advantages and disadvantages of a sociogram. 
Brief answers to these questions follow. 

It may be said that a sociogram is probably the best instrument yet de- 
vised to reveal the social structure of a group. It shows interrelationships 
among individuals and relationships of each individual to the entire group. 
It provides a teacher or group leader with information that will help him 
to understand the behavior of the group and to function more effectively 
in working with that group. There are many relationships and sub-groups 
within any class or group which are not apparent on the surface. 

It is important that appropriate action be taken soon after the sociogram 
has been completed and examined.’ If the teacher has asked pupils to tell 
with whom they weuld like to work, groupings should be formed on the basis 


5 Helen H. Jennings, Sociometry in Group Relations (Washington, D.C.: American 
Council on Education, 1948). 
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of what the children have asked for, as far as that is possible. "The effect on 
pupils of carrying through is very salutary. It goes without saying that not 
to do so has the opposite effect, and that pupils will lose interest in socio- 
grams if they come to believe that nothing happens as a result of their ex- 
pressions of preference. 

The booklet mentioned earlier? suggests the following uses of a socio- 
gram: 


1. To identify mutual choices, stars, isolates, chains, islands, and triangles 
or circles. 

2. Studying race or nationality in relation to group structure. In this case 
racial groups may be coded by use of different shaped figures (as was done 
for sex in Figure 10). 

3. Studying age or maturity in relation to group structure. 

4, Studying the relation of total group structure to out-of-school groupings, 
as scouts, sororities, etc. 

5. Studying the effect of certain experiences. In this case there should be a 
sociogram “before and after." Thus it may be used to study the effect of vari- 
ous methods of choosing committees on the structure of the group. 


In the same publication some limitations of sociograms are mentioned. 
First, it is pointed out that sociograms are only as valid as the rapport be- 
tween teacher and pupils will permit. Pupils must sign responses if the 
results are to be most useful, and if there is resistance to doing this or to 
answering the questions the responses are not likely to be worth much. 

Second, it is pointed out that since group structure, especially among 
groups of younger children, is quite fluid, the reliability of a single sociogram 
may not be very high. 

Third, the way in which the data are gathered may force responses which 
are misleading. For example, the sociogram does not reveal differences 
between strong and weak feelings, or even hostility. The point is made 
that to require three choices may force the nomination of someone for whom 
there is really no feeling of attraction and for whom there is perhaps even 
a feeling of dislike. ; 

Fourth, it is important to remember that a sociogram merely reveals 
conditions; it does not give answers or solutions. A teacher may decide 
that acceptance of an isolate by other members of a group must be brought 
about by authority, if necessary. Measures taken to accomplish this, 
even though subtle, may result in stronger feelings of rejection instead of 
larger acceptance. The solution or amelioration of conditions revealed by 
a sociogram depends upon the use of other techniques such as anecdotal 
records, interviews with individual children, and further careful study of 


? Horace Mann-Lincoln Institute of School Experimentation, op. cit. 
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the total situation. Perhaps the status of sociograms is best summed up by 
this quotation from the same bulletin: 


Once a sociogram has been plotted, it is a beginning not an end. It raises 
questions rather than answers them. Perhaps its greatest value is that it directs 
the attention to certain aspects of group structure which will lead to further 
observation of individual and group behavior. To date, we have few, if any, 
generalizations which can be applied in the interpretation of sociograms, al- 
though we are beginning to find certain tentative hypotheses. We are in great 
need of carefully reported anecdotes of group behavior recorded by teachers 
who are sensitive to problems of group behavior. If the making of sociograms 
encourages such observation and recording, they shall have fulfilled an im- 
portant function. 


e Learning Exercises € 


8. Using the following data, construct a sociogram: 


Chooser First Choice Second Choice 
Jerry Jim Harry 
Bill Bob Sam 
Carl Jack Jerry 
Jim Harry Sam 
Jack Bob Sam 
Ed Sam Harry 
Sam Bill David 
Harry Jim Ed 
David Bill Sam 
Frank Bill Jack 
Bob Bill David 
Tom Frank Bob 


Suggestions: Start with Sam and Bill and place around them those pupils who chose 
them, and then work in the rest. Try to construct a graph which has straight, 
right-angle lines, and no lines crossing each other. Use a solid line for first. choices 
and a broken line for second choices. 
9. Can you identify any stars, isolates, chains, or cliques in your chart? 
10. If you were dividing this group into four subgroups or committees of three 
each, how would you proceed? Give reasons for your groupings, 
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The Measurement Program 


WHAT IS A MEASUREMENT PROGRAM? 


The term measuremenl program as used here refers to any systematic use 
of tests or other non-test devices, often at regular or planned intervals and 
under competent direction, to solve some educational problem, or to help 
further the purposes of the school. Put in another way, a measurement 
program is undertaken whenever teachers, counselors, or other school per- 
sonnel use measurement in a systematic effort to facilitate the attainment 
of educational objectives. 

In large school systems the measurement program is usually directed 
and carried out by a central office, probably under the director of research. 
'The program generally involves the use of standardized instruments of 
various types at specific points or grade levels, and such measurement occurs 
regularly every year. A large number of tests and measuring instruments 
of different types are used for a variety of purposes. 

However, if a fifth grade teacher informs herself about measurement, 
gives some diagnostic tests in arithmetic to her pupils, follows the tests 
with remedial work based on the testing, and then gives another form of the 
same test to the pupils to see what improvement has taken place and what 
still needs attention, this may also be regarded as a measurement program. 


Measurement Includes a Broad Variety of Instruments 


As indicated in Chapter 1, the term measurement as used in this book in- 
cludes tests, rating scales, check lists, and instruments such as sociograms, 
anecdotal records, and thelike. However, in discussing the broad principles 
of a measurement program it would become tiresome to refer to all such de- 
vices constantly. Since it is recognized that tests are the tools which are 
used most often and which form the groundwork of most measurement pro- 
grams, we shall, for the sake of simplicity and brevity, concentrate our dis- 
cussion on the use of tests in a measurement program. However, we shall 
refer to other types of instruments where this seems appropriate. It is basic 
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to the point of view of this book that the instruments used in a measurement 
program should always be chosen or developed in the light of the purposes 
for which they are to be used. 


The Measurement Program is a Cooperative Enterprise 


Without the support and cooperation of teachers and counselors, the 
results of a measurement program can scarcely be utilized to the fullest 
extent. When some action is to be taken as a result of the program, whether 
it be grouping, counseling, remedial work, or any of a dozen possibilities, 
teachers and counselors may defeat the very purposes for which the testing 
was done by not cooperating in the program. 1f the program has been 
“dictated” by authorities rather than carried out with their advice and 
cooperation, it is possible that those involved will not respond wholeheart- 
edly. Unfortunately, administrators, though fully aware of this fact, do 
not always take the trouble and time to secure the support of their staffs. 
The result is that the programs sometimes fall far short of attaining their 
maximum usefulness, or fail entirely. * 

On the other hand, one must recognize the fact that it is not always easy 
to stimulate the active cooperation and interest of teachers in the systematic 
use of measuring instruments. Some teachers resent the interruption 
it may cause in their usual routine, some do not appreciate the extra demands 
on their time and energy, and a few are prejudiced against tests of any kind. 
They do not like the idea of having their pupils examined by any means 
other than those which they themselves have devised. Where such attitudes 
exist, they must be changed before a measurement program can be carried 
on with any reasonable assurance of cooperation and success. It may take 
some time to accomplish this, yet there are a number of ways of creating 
more favorable attitudes: selected teachers can be sent to summer school to 
take courses in measurement, professional libraries can be built up, and 
teachers can be urged to participate in workshops and institutes dealing with 
problems of measurement and evaluation. In-service training programs 
may also focus attention on measurement as a means of facilitating curricu- 
lum revision and improving instruction. 


Responsibility for the Measurement Program 


A program of any consequence is always undertaken with the coopera- 
tion and responsibility of more than one person. The program may involve 
only the classroom teacher and her supervisor or principal, or the work may 
be planned and carried out with the cooperation of the entire staff of a 
school or school system. In the latter case it is customary to entrust most 
of the actual direction to one qualified person or to a representative com- 
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mittee. As we have said, it is almost axiomatic that unless a measurement 
program has the active cooperation and support of all concerned it cannot 
achieve its maximum usefulness. When a measurement program is carried 
out to meet needs or to solve the problems which the teachers themselves 
regard as important, and when those teachers participate actively in plan- 
ning and carrying out the program, it will have a good chance of succeeding. 
It is also helpful to have parents understand the reasons for measurement 
so that they too will support the program. If parents can see that the re- 
sults of measurement help to bring about better learning and adjustment 
on the part of their children, their confidence in the usefulness of measure- 
ment and their faith in the school will be increased. 


PLANNING A MEASUREMENT PROGRAM 


Purposes 


A measurement program will be successful to the extent that it accom- 
plishes the purposes for which it is designed and carried out. ‘Therefore, it 
must be planned in accordance with those purposes. This is a matter for 
cooperative endeavor by all concerned. While many teachers are not well 
acquainted with standardized tests and techniques of measurement and 
appraisal, most will know what the educational problems are and they will 
know of many situations in which measurement may be helpful. School 
psychologists, counselors, directors of research, and other personnel with 
more specialized training can usually supply the leadership, the technical 
knowledge, and the skills needed for setting up a measurement program. 

Whereas in a smaller school or community the planning of a measurement 
program may be undertaken by the entire staff, such a procedure will gen- 
erally be too cumbersome or unwieldly in a larger system. In the latter 
case it is generally better to have a committee made up of representatives of 
various groups, grade levels, schools, or districts to assume responsibility 
for planning and carrying out the program. This is not to say that the 
entire staff loses contact with the program. On the contrary, general 
teachers’ meetings from time to time may be devoted to over-all planning, 
progress reports, discussions, and implementation of results. Furthermore, 
occasional reports to the community may be used as a means of improving 
relations between the schools and the parents. 

Below are listed some of the major purposes for which a measurement 
program may be carried on. A more extensive discussion of each of these 
purposes, together with practical suggestions on using test results, will be 
found in the next chapter. 
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Purposes or MEASUREMENT PROGRAMS 


Classification of pupils 

Homogeneous grouping 

Diagnosis and remedial work 

Counseling and guidance 

Marking 

Motivation 

Identification and study of exceptional children 
Interpreting schools to the community 
Improvement of school staff 

Educational research 


Sbape 


MEOS 


The above list is based on various studies and reports of the use of meas- 
urement in schools, and, while not exhaustive, it probably includes most of 
the common purposes for which educational measurement is used. 


Time of Year for Testing 

When the purposes of the measurement program have been decided upon, 
several other considerations immediately come to the fore. One of these is 
the time of year for giving the tests. Often this matter resolves itself into 
a choice between giving the tests at the beginning of the school year or near 
the end, The decision on timing will usually depend largely upon the pur- 
poses for which the tests are intended. For example, diagnostic testing and 
testing for purposes of grouping or grade placement will most profitably 
come early in the school year, while testing for purposes of promotion, edu- 
cational counseling, marking, and comparison of achievement with norms 
will usually occur near the end of the term or year. On the other hand, 
some of the purposes for which measurement programs are carried on are 
unrelated to the time of year. (See above list.) 


Frequency and Grade Levels of Testing 

Questions which must be decided upon early are the frequency of the 
testing and the grade levels at which particular tests are to be given. In 
part, these matters are determined by the purposes for which the testing is 
intended. For example, if tests are to be given for the purpose of voca- 
tional counseling there will generally be less emphasis on and less need for 
testing below the secondary level. On the other hand, diagnostic testing in 
arithmetic or reading will almost certainly be started in the earlier grades of 
the elementary school. 

The frequency of testing also depends on the purposes, but it is further 
affected by such considerations as the kinds of measuring instruments being 
used and the amount of money and time available for the work required. 
1f devices such as rating scales or sociograms are a part of the program, it 
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may be desirable to use them more often than one would ordinarily use 
standardized tests of intelligence or achievement. Many a measurement 
program, undertaken with enthusiasm and high hopes, has failed because 
those responsible greatly underestimated the expense and the involved 
operations necessary to carry it through. The initial cost of standardized 
tests is frequently the smallest item of expense; getting the tests properly 
administered, scored, and interpreted requires much time and effort. It is 
far better to undertake a modest program and complete the work required 
to put the results to effective use than to try to carry on a more extensive 
and ambitious program, only to have it bog down. 


A Minimum School- or Community-Wide Measurement Program 


No program can be prescribed which will fit every situation. Nevertheless, 
some suggestions will be made to help the prospective teacher, counselor, or 
administrator set up a kind of priority list for the planning of a testing pro- 
gram. 

If only one type of test is to be given, at least as a beginning, the first 
choice should almost certainly be a group intelligence test. If no standard- 
ized tests have been used before, it is desirable to give a group intelligence 
test to every pupil. Since the 7.Q. based upon one group test is not com- 
pletely reliable, any cases which raise serious questions or present discrepan- 
cies with other known facts about an individual should be tested as soon as 
possible with another form of the same test. This point is important because 
two forms of the same test will give results which are directly comparable, 
whereas J.Q.'s obtained through the use of two different tests must be 
equated by standard scores or similar derived scores before they can be 
directly compared. 

The recommendation of a test of intelligence as the essential minimum is 
based on several considerations. In the first place, the 7.Q. of a pupil cannot 
be accurately determined without such a test. In the second place, for edu- 
cational purposes this information about a pupil is probably the most useful 
and important that we can learn. The 7.Q. gives more insight into his work, 
achievement, and general mental ability than any other single fact about 
him can provide. 

Tf testing is to be done at regular intervals after the first year in which a 
measurement program is started, it would be advisable to give intelligence 
tests either in kindergarten or early in the first grade; at about the fourth 
grade; again in the sixth grade if there is a junior high school, or eighth grade 
if the system is organized on the 8-4 plan; and again in the tenth grade. 
The results of these measurements should always be made a part of the 
cumulative record which accompanies the pupil as he progresses through the 
elementary and secondary grades. 
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If more than one type of test can be given, and the results put to use, 
a reading readiness test should be administered in the kindergarten or 
first grade, an achievement battery in the third and sixth or seventh grade, 
and an interest test or inventory in the ninth and twelfth grades. These 
tests will supplement the results of the intelligence tests at critical points 
in the pupil’s school career in ways which are most useful and appropriate 
at those points. 5 

The use of other types of tests, such as diagnostic, personality, aptitude, 
reading, and those in specific school subjects, and evaluative instruments 
such as rating scales, check lists, and anecdotal records, should be under- 
taken where necessary, with a view to the available resources of the school 
and to the other factors peculiar to each situation. In every case, the 
purposes of the measurement program should be the dominating factor in 
determining its nature and extent. 

A tabulation of a recommended minimum school- or community-wide 


annual testing program is given here: 


Grade Level Type of Test 

Korl Intelligence 

Korl Readiness 

III or IV Intelligence 

III or IV Achievement Battery 
(including reading) 

VI or VIII? Intelligence 

VI or VII Achievement Battery 
(including reading) 

IX Interests 

x Intelligence 

XII Interests 


This program amounts to nine separate tests given annually at various 
levels throughout the usual grades of the elementary and high schools. 
Where two tests are recommended for use in either of two grades, it probably 
would be best, other things being equal, to distribute the burden by giving 
one test in each grade instead of giving both tests at the same grade level, 
This is particularly desirable where the scoring is done by teachers, and it 
also requires less of the pupils’ time for such testing in any given grade. 
Such a program should not be undertaken lightly. It will require much 
time and work, although the results should be worth the effort many times 
over. The cost may safely be reckoned on an average of ten cents per pupil 
annually, provided that scoring the tests and tabulating and interpreting 
the results will not necessitate additional expenditures, and provided that 
the test blanks can be used over and over again with separate answer sheets. 
1 Depending on whether VI or VIII is the last grade in the elementary school. 
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Larger school systems often carry on testing programs far more exten- 
sive than the one outlined. With a central staff organized for such work, 
and ample financial support, a great deal more can be done. However, the 
above plan will provide a good foundation, and it represents the type of 
program that most schools with limited funds for measurement can carry. 

It will take three or four years under this plan for every pupil to be 
reached unless a group intelligence test is given to every pupil the first year. 
Tf a school-wide test is given the first year of the program, there will be 
information available on every pupil as soon as the tests can be given and 
scored, and this will be a real advantage to teachers. The regular program 
may be launched the second year. Even if the school-wide intelligence test 
is not given the first year, there will still be test results available at four im- 
portant levels of the child's progress through the school. This will be a use- 
ful beginning and will help to introduce the program gradually so that those 
responsible will more easily be able to absorb the load. 


SELECTING AND OBTAINING THE TESTS 


Selecting the Tests 


Chapter 4 contains a discussion of the important criteria for judging the 
quality of tests. Here we need only point out that in planning a measure- 
ment program one should use the best available instruments for it. The 
criteria of reliability, validity, objectivity, ease of administration, scoring, and 
interpretation, availability of equivalent forms, adequate norms, and economy 
provide a sound basis for appraising any measuring instrument, although, 
as we have pointed out, all of these criteria do not necessarily apply to every 
type of instrument. 

In addition to the criteria mentioned above, a number of less tangible 
considerations usually influence the selection of instruments. In the case 
of standardized tests, for example, the deciding factor may be simply a 
general impression of the whole test. If those responsible for the selection 
like an instrument, if it seems to measure objectives which are important, 
and if they think it is suitable for their situation, that test or instrument will 
often be chosen in preference to one that meets the technical criteria more 
adequately. Probably the best that can be hoped for is an objective choice 
based on careful consideration of all available information about the test 
and the situation in which it is to be used. 

In a measurement program the task of selecting tests may be delegated 
toa committee. Sometimes the tests are selected by a member of the super- 
visory or administrative staff and occasionally by the director of research 
or by the counseling staff. However, if the program is to be of the type 
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outlined above for all-around basic purposes, the tests should be selected by 
a committee on which teaching, counseling, research, supervisory, and ad- 
ministrative personnel all have representation. 

This committee should have full responsibility and authority to obtain, 
examine, select, and purchase the tests in the quantities needed. It should 
also have the authority and the initiative to encourage local groups to 
develop measuring devices for local needs where commercially available 
instruments are not adequate. The committee should feel free to consult 
with experts and with any school personnel in making its choices, yet its 
decisions should be final and should be accepted as such by all concerned. 
That is, as long as all elements of the school staff have representation on the 
committee, and as long as the committee makes a thorough study of avail- 
able instruments in relation to the purposes of the program, its choices 
should not be subject to veto by administrative authorities or other groups 
except in the most extraordinary circumstances. If the committee re- 
sponsible for the over-all program is a large one, smaller sub-committees 
may be appointed to look after various phases of the program. The selection 
of tests might well be done by such a sub-committee. 


Obtaining the Tests 

It is common practice for test publishers to put up tests in packages of 
twenty-five or thirty-five. Each package contains the specified number of 
copies of the test, a manual of directions, a scoring key, a class record sheet, 
and any other materials necessary for proper use of the tests, except answer 
sheets. These are nearly always sold separately. Test publishers, as a 
rule, will not break packages of tests; in ordering, therefore, it is advisable 
to request a number which can be shipped in unbroken packages. For 
example, if tests are needed for one hundred and seventy pupils, one would 
ordinarily order one hundred and seventy-five (seven packages of twenty- 
five each or five packages of thirty-five each). The same is true of answer 
sheets. Prices are usually quoted for quantities of twenty-five or more. 

Of course, the above applies only to paper-and-pencil tests. If the pro- 
gram involves the use of other types of materials such as sets of pictures, 
toys, nuts and bolts, phonograph records, etc., such equipment will usually 
have to be bought in single complete sets. Such material is not consumable, 
as are most tests, and the same instruments therefore may be used repeat- 
edly. 

Where measuring instruments or techniques are to be developed locally, 
it has usually been found most practical to have a committee assume the 
responsibility of leadership. The committee presumably knows the local 
situation and, with some technical assistance from a specialist in this field, 


can usually produce a very acceptable instrument. 
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Handling the Tests Prior to Administration 


The committee or person responsible should take charge of all measure- 
ment materials when they are received and keep them in a safe place until 
they are to be used. Obviously, if pupils have prior access to tests the re- 
sults will be invalidated. The assumption in the use of a standardized test 
is that everyone who takes it has an equal chance and that no one has an 
unfair advantage. Although classroom teachers are usually most scrupulous 
in such matters, their enthusiasm and eagerness to see pupils do well will 
sometimes lead them to give assistance which they should not give. 

The story is told about a high school principal and the teacher of mathe- 
matics who had agreed that a certain test should be given in the plane 
geometry classes. The tests were ordered, received, and turned over to the 
teacher for safe-keeping until the time set for the testing. A few days before 
the tests were to be given, the principal dropped in to the room to speak to 
the mathematics teacher, and he was surprised to find several problems from 
the test copied on the board, and the teacher discussing these problems 
with his pupils. After the class had been dismissed, the principal asked the 
teacher to explain. The teacher replied, “I was so anxious to see what they 
would do with the problems that I couldn't wait. I just had to try them 
out on a few." While this teacher's enthusiasm and interest were highly 
commendable, it is clear that his understanding of the purposes and use of 
standardized tests left much to be desired. 


Answer Sheets 


A facsimile of a portion of a standard answer sheet which may be scored 
by hand or by machine is reproduced in Figure 11, page 327. 

The use of printed standard answer sheets has developed to the point 
where almost every test publisher has similar forms available. These re- 
duce the expense of giving standardized tests since the tests themselves 
may be used over and over, and the cost of the test blanks may thus be 
spread over a period of several years. Standard answer sheets usually cost 
from two to five cents apiece, Since test publishers rely increasingly 
upon the sale of answer sheets for revenue, it is only natural that they 
should do what they legitimately can to make necessary the use of their own 
answer sheets with their tests. Even so, there are many published tests 
with which standard printed answer sheets can be used, often at a slight 
saving. Also, it is sometimes possible to mimeograph answer sheets that 
can be used with a standardized test and thus save most of the cost of the 
printed ones. While this procedure may seem advantageous from the stand- 
point of economy, it may prove to be “penny-wise and pound-foolish." 
For one thing, the scoring keys may not fit the homemade answer sheet, and 
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Portion of a Standard Answer Sheet 
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new ones might have to be prepared. In many ways they may be less 
efficient than the published ones. What is more important, the use of a 
locally prepared answer sheet may unfavorably affect the conditions of the 


administration of the test. 
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Ethics 

A word on the ethics of using standardized tests may be appropriate at 
this point. In the first place, it must be recognized that published stand- 
ardized tests are copyright material. No part of such tests may be copied, 
duplicated, or reproduced in any form without written permission from the 
holder of the copyright. To do so is to break the law. Aside from the 
legal aspects, there is a moral obligation which is equally great. Authors 
and publishers of standardized tests must expend large amounts of time, 
professional competence, and money in producing these tests. A single 
test may have involved the work of several persons for three to five years, 
as well as the cooperation of dozens of other people and the expenditure of 
thousands of dollars. Except for professional recognition accorded the 
authors and, to some extent, the publishers, the only recompense they re- 
ceive is a small profit from the sale of tests and answer sheets. Therefore, 
to reproduce such tests or accompanying materials or parts thereof without 
express permission is not only unlawful, but also unethical since it deprives 
of their rightful compensation those who have produced the tests. 

Another aspect of the ethics of using standardized tests has already been 
touched upon obliquely in the anecdote about the mathematics teacher 
and the principal, but it will bear amplification here. The continued use 
of standardized tests of intelligence and achievement requires that their 
nature and content be kept confidential until the tests are administered. 
Even then, such information should be divulged only under the conditions 
specified by those who produced the tests. This means that the material, 
questions, items, problems, and the content in general must not be used for 
any other purpose or in any way other than as a test. It is not permissible 
to discuss the nature of a standardized test or its contents with any pupil, 
parent, or other unauthorized person before the test is given, or to go over 
the test with such persons afterwards. A firm distinction should be made 
between a standardized test and teaching materials. If standardized tests 
are used for instructional purposes they soon lose value as measuring in- 
struments. Teachers often feel that it is useful to go over an achievement 
test with pupils after it has been scored in order to discover the pupils’ 
strengths and weaknesses. Strictly speaking, this should never be done 
with standardized tests. This type of activity can be carried on equally 
well, or even better, with locally made tests, and the risk of the misuse of 
standardized tests can thus be avoided entirely. 

The only possible exception to this principle is the standardized diagnostic 
test. With this type of instrument it may be permissible under certain 
circumstances to go over the test with the individual pupil to let him see his 
errors, but even here it is usually possible to accomplish the same objective 
in other ways. The manual of directions for such tests contains specific 
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instructions as to how the results of a diagnostic test may be used most 
effectively in remedial work. If the user adheres to these directions and 
suggestions he will usually be successful and ethical. 

If pupils are permitted to go over their diagnostic test papers for the 
purposes mentioned above, the teacher should always make sure beforehand 
that two or more equivalent forms of the test are available. Then, if re- 
testing is to be done, a different form can be used, thus minimizing the 
effect of familiarity with specific details of the form first used. What we 
have said about achievement tests applies with even greater force to tests of 
intelligence and personality, and to evaluative devices such as ratings, socio- 
grams, and anecdotal records. The results of these as well as the original 
instruments are to be held in strict confidence and should be accessible only 
to authorized school personnel. These principles of usage may seem strict, 
especially to the inexperienced student and user of standardized tests, but 
they are not unduly so. Psychologists and educators involved in the pro- 
duction and proper use of measuring instruments are genuinely concerned 
with this problem and have published a code of ethics which includes recom- 
mendations for the proper use of psychological tests.” 

Nothing said above should be interpreted to mean that teachers them- 
selves should not analyze results of standardized tests to identify strengths 
and weaknesses of pupils. On the contrary, this is one of the most important 
uses to which test results can be put. A teacher may thus determine what 
objectives or outcomes of instruction have been achieved to a satisfactory 
degree and by what pupils. Likewise, he can also identify weaknesses 
both of individual pupils and of the class as a whole, and proceed to remedy 
such weaknesses. More will be said about this matter in Chapter 14, 


SCHEDULING THE TESTS 


Day of Week and Time of Day for Testing 

Considerations relating to the time of year for testing have already been 
mentioned. It is also necessary to decide when each test is to be given, par- 
ticularly if a school-wide or community-wide program is planned. This 
involves a decision on the day of the week and the time of day for testing. 
It is usually best to have the tests administered to all pupils at the same 
time, for such a plan causes less disruption of the school program and has 
the added advantage that pupils’ discussion of the tests will not work to the 
benefit of some and to the disadvantage of others. It also reduces or elimi- 


2 American Psychological Association Committee, Ethical Standards for Psychologists 
(Washington, D.C.: American Psychological Association, 1953), pp. 143-55. 
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nates entirely the period of worry and dread which some children inevitably 
suffer if they know in advance that they are to take some tests. 

With regard to the matter of pupil concern about forthcoming tests, it 
has been the writer’s experience that it is best to give standardized tests 
without any previous announcement to pupils. This eliminates all fear in 
advance of the testing, it eliminates hasty “cramming,” and pressures on 
the teacher for hints on the nature of the tests. Nothing is gained by an- 
nouncing such tests in advance, and much suffering and annoyance may 
result. 

It is probably best to give the tests in the morning because pupils are 
likely to feel more alert then than later in the day. "There is a psychological 
advantage in this, if not an actual, measurable difference in performance. 
It is also desirable, generally speaking, to give tests near the middle of the 
week. Monday is often “blue Monday,” and Fridays are likely to be 
crowded with activities of more compelling interest. While none of these 
factors may have a demonstrable effect on test performance, they may all 
haye some effect on the state of mind of pupils, their attitudes towards 
the program, and their concentration. A favorable attitude towards the 
testing is advantageous to the teacher and to the pupil. Ideally, the pupil 
should feel that he has been able to do his best on the test. 

Since absentees create additional problems it is well to give the tests 
when absences are likely to be at a minimum. If the tests require more 
than one sitting it may be necessary to spread the testing over a period of 
several days. Some achievement batteries require several hours of testing 
time and this may require several sittings, especially for younger pupils. 
If there are large numbers to be tested it is probably best to arrange a 
schedule of sessions to which every class or group will adhere. If a small 
number of pupils are involved — perhaps one or two rooms — the schedule 
can be arranged according to the convenience of those concerned. 

Ordinarily, it is desirable to plan a schedule for the administration of tests 
ina testing program. This increases the efficiency of the program and helps 
to avoid the intrusion of personal preferences of the individual teachers. 
When the time for giving tests is left to the choice of the individual teacher, 
it is well to require that the testing be completed within a specified number 
of days. Otherwise, there may be delays and postponements which will 
hold up the entire program. 


Place of Testing 


The choice of the place for testing will depend on circumstances. With 
younger pupils it is generally best to administer the tests to the pupils in 
their usual surroundings as long as these provide proper conditions for test- 
ing. It is also better, if possible, to test very young pupils in small groups of 
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15 or less. With older pupils location is probably not so important. Testing 
pupils in their own home-rooms eliminates the problem of working out a 
room assignment schedule and considerably simplifies this aspect of the 
work. On the other hand, if large numbers are to be tested, it is more 
efficient to test as many pupils at one time as facilities will permit. It is 
obviously quicker to test 200 pupils at once than to test five groups of 40 
separately. The number that can be tested properly at one time is limited 
by the facilities available and by the organizational ability of those in charge. 
With the necessary facilities and assistance, 1,000 pupils can be tested just 
as easily as 100 or fewer. 

The place of testing should provide conditions and facilities necessary to 
the correct and most satisfactory administration of the test. In part, the 
choice of location will depend upon the factors of accessibility and availa- 


bility. 


ADMINISTERING THE TESTS 


It is essential that standardized tests be given exactly according to direc- 
tions. ‘These are sometimes long and complicated. No one should expect 
to be able to pick up a manual just before the test is to begin, and, without 
previous experience and study, step into the job with complete assurance. 
Time and study are a necessary part of the preparation for giving a stand- 
ardized test. The inexperienced examiner should have a more experienced 
person help him review instructions and procedure beforehand. 

If a considerable amount of testing is to be done, the best plan may be to 
have all of the administration handled by only a few people. They can be 
given fairly intensive training by the person best fitted to do it — the 
school psychologist, research director, counselor, or other qualified person. 
This small group of specially trained teachers can then administer all the 
tests. Such a plan is almost sure to increase accuracy, uniformity, and 
efficiency. 

With very young pupils it is sometimes preferable to have the testing 
done by their own teacher, for children frequently will be more at ease and 
will respond better with someone they know and like than with a stranger. 
However, when teachers administer tests to their own pupils it is important 
that the testing be done objectively and that the instructions for administer- 
ing the tests be adhered to. 


Qualities of a Good Examiner 
Most teachers and counselors can learn to administer standardized tests 
successfully, yet a few seem constitutionally unfitted for the task. A list of 


332 The Measurement Program 


qualities which the successful administrator of standardized group tests 
should possess would almost certainly include the following: 

a. Ability to understand and follow directions. The person who 
is to give a standardized test must have the ability to follow directions 
exactly. Sometimes these require the performance of complicated activities 
by the pupil and accurate timing by the examiner. Not every person likes 
to undertake involved procedures of this nature, and not everyone is quali- 
fied to perform these tasks and supply students with the needed guidance 
and assistance. The examiner must be willing to read and study the direc- 
tions until he understands them thoroughly. He should work through the 
entire test himself before he attempts to administer it so that he will be 
thoroughly familiar with every part of it. 

Once in a discussion among teachers about the uses of tests in the class- 
room, one member of the group, a middle-aged teacher, raised a problem: 
she stated that on a certain English test which she had used for a number 
of years her pupils invariably made scores well above the norms for the 
grade. This puzzled her, since the pupils were not otherwise unusual or 
exceptional, and she was unable to account for their very high attainment 
on this test. No one in the group was able to explain this, and the matter 
was dropped. The conference proceeded to other matters, but sometime 
later the same teacher brought up the problem again. This time, however, 
she happened to mention that when she administered the English test she 
ignored the time limits set in the directions and permitted her pupils to 
work on the test as long as they wished! This teacher seemed quite inno- 
cent of any notion of error on her part! 

Thorough study of the test and directions for its use, and careful adher- 
ence to instructions in every detail are the essentials for successful adminis- 
tration of a standardized test. Most teachers and school personnel can 
meet these requirements without undue difficulty. As teachers gain in 
experience and confidence, the administration of standardized tests becomes 
fairly easy and often stimulating and enjoyable. 

b. Ability to maintain the attention and whole-hearted coop- 
eration of a group. The administrator of a standardized test must be 
able to command the attention of a group and draw from each member his 
best efforts. If the test is a good one the tasks it involves and the instruc- 
tions to the pupils will help the examiner hold the pupils’ attention. Per- 
haps most important, the examiner himself must give an impression of 
serious attention and an attitude of regard for the importance of the task 
at hand. Many a well-meaning examiner, in his attempt to set pupils at 
ease, has spoiled the entire effort by such remarks as, “Don’t take this too 
seriously,” or, “It doesn’t mean anything.” Certainly children should not 
suffer unnecessary emotional strain in taking a test, yet a test is a test and 
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if the child is to do as well as he possibly can he must be urged to give his 
undivided attention and cooperation. The examiner must avoid instilling 
in his pupils either an attitude of extreme emotional tension which may 
prevent a child from doing his best or an attitude of careless indifference 
which will defeat the very purpose of the test. 

c. Ability to read directions aloud clearly and distinctly. Read- 
ing aloud is something of a lost art among the younger people of America, 
for the emphasis in the teaching of reading has shifted almost completely 
to silent reading. The proper administration of a group test requires that 
the directions be read clearly and distinctly. This requires a good voice 
and the ability to use it effectively. It is necessary for every examiner, even 
the experienced one, to practice reading the directions aloud before giving a 
test for the first time. Through such rehearsal he will learn the proper in- 
flections, pronunciations, and phrasing. Sometimes the wrong intonation 
can change the meaning of a sentence and cause confusion or misunder- 
standing, 

By practicing reading the directions aloud the examiner can also gain 
sufficient familiarity with them so that he can make the reading more 
pleasant and meaningful to his audience. He should be able to take his eyes 
off the printed page occasionally, not only to see if all are paying attention 
and following him, but also for the good effect this will have on his audience. 
It may be necessary for the examiner to interrupt his reading occasionally 
in order to explain an example given in the directions or to make sure that 
all pupils understand what they are to do. A thorough mastery of the direc- 
tions helps to smooth such breaks and avoid awkward pauses. 

d. Ability to be objective. A teacher measuring her own pupils with a 
standardized test may find it very difficult to be objective because she is 
aware that the test results may often conflict obviously with her own judg- 
ment. She observes Johnny struggling unsuccessfully with a problem in the 
test and she feels that she must give him “just a tiny hint to help him 
solve it,” for she has seen him solve similar problems many times. The 
“tiny hint” may not even be expressed in words. As she looks at Johnny’s 
answer sheet over his shoulder, a frown or a surprised expression may be all 
that is needed to set him on the right track. This may make both Johnny 
and the teacher happy at the moment, but it destroys the objectivity of a 
good testing situation and it may later prove harmful to all concerned, 

If one is having the antifreeze solution in the radiator of his car tested 
when the temperature is below freezing and falling rapidly, he does not ask 
the service-station attendant to make the measurement sound better than 
it actually is in order to save the cost of a quart of the antifreeze. To do so 
might prove very expensive and even disastrous. Ina situation of this sort 
we want as objective and accurate a report as possible so that we may know 
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what to do to be safe. In the same way, we should seek objectivity and 
accuracy when using standardized tests. To say that the hygrometer the 
garage mechanic uses or the tests that educators use are not exact instru- 
ments merely beclouds the issue. We do not disregard the report on our 
antifreeze because the mechanic’s instrument is not as accurate as those of 
the scientist who is dealing with very minute quantities in theoretical 
physics, or because there is some variation in the grades of antifreeze manu- 
factured by different companies. We know that any instrument — me- 
chancial or educational — that is properly selected and used will give results 
which are reasonably accurate, or which are at least more dependable and 
accurate than subjective opinion or plain guessing. The point here is 
that educational tests should be used as objectively and carefully as pos- 
sible, with full realization of their limitations and with regard for the fact 
that using them carelessly or inaccurately may make the tests quite value- 
less or misleading. Of course, the person administering standardized tests 
or other measuring instruments should demonstrate warmth, understand- 
ing, and every attitude calculated to encourage pupils to enjoy taking tests 
and to do their very best; yet the examiner in his desire to see his pupils do 
well must do nothing that will invalidate the results of the testing, and he 
must not deviate from the test instructions. 


Physical Conditions of Testing 


The examiner should observe a few simple rules concerning the physical 
conditions of testing. First, the room should be comfortable. It should be 
well-lighted, well-ventilated, and well-heated. The seats should be com- 
fortable and of appropriate height. It is not uncommon to enter a room 
where testing is going on to find it crowded, the temperature too high or too 
low for comfort, all windows tightly closed, and the air almost unbearable. 
Frequently, the occupants of the room are entirely unaware of these con- 
ditions. Again, when pupils are shifted to a different room for testing, the 
desks in the new room may be totally unsuitable for them — too large or 
too small, for example. Such conditions can usually be avoided by calling 
them to the attention of responsible persons before the testing begins. 

It is also a good idea to place a sign on the door of a room being used for 
testing. This will keep out those whose business is not urgent and will re- 
duce the number of interruptions. 

The room should be large enough to permit spacing pupils in a way that 
the temptation to copy will be minimized. If the desks are movable they 
can usually be placed at a suitable distance from each other without much 
trouble. If there are too many pupils to be properly accommodated in a 
single room the teacher or examiner should try to divide the pupils so that 
the test can be given in two or more rooms. If the desks are fixed it is 
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desirable to seat pupils at alternate desks and, if possible, in alternate rows. 
If tables and chairs are used pupils should be seated so that there is no 
temptation to compare papers. Such measures do not reflect on the honesty 
of pupils; rather, they are intended to reduce to a minimum the opportunity 
and temptation to cheat. 


Duties of the Examiner 


The person in charge of administering a standardized test is responsible 
for the proper conduct of the examination. He should have one or more 
proctors or assistants if the group is large, but the over-all responsibility is 
his. He needs a good stop watch or at least a watch with a sweep second 
hand, Many standardized tests require accurate timing and this is always 
the examiner’s responsibility. If he has a stop watch he should know 
exactly how to operate it beforehand. Stop watches of different makes and 
quality vary somewhat in technique of operation. After the testing has 
begun, it may be impossible to correct an error in timing resulting from 
faulty manipulation of the watch controls, 

If the examiner has only a watch with a second hand he should adhere to 
the following procedure: 

a. The examiner should synchronize the second hand and the minute 
hand so that both are together, that is, at the end of a minute at the same 
time. 

b. At the moment he finishes reading directions for a timed test and 
says “begin,” or “go,” he should glance at his watch, looking at the second 
hand first, and then the minute hand, immediately wriling down the time, 
thus: 9-42-21, which means that the pupils began work on that part at 
forty-two minutes and twenty-one seconds past nine o'clock. 

c. Then, if the time allowance for that part of the test is two minutes 
and thirty seconds, he adds that to the starting time, thus: 


9-42-21 
2-30 
9-44-51 


d. Forty-four minutes and fifty-one seconds past nine is the time when 
he should give instructions to stop, turn over the page, or whatever may be 


indicated. 

No examiner should rely on his memory for time of beginning, or attempt 
to figure out by mental arithmetic the time of stopping. It is impossible for 
most persons to do this accurately. Furthermore, during the administra- 
tion of a standardized group test there are many things to do which are 
more useful and important than trying to calculate and remember times of 


starting and stopping. 
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The examiner should be alert to everything that goes on in the room 
during testing. It is his responsibility to see that proper conditions for 
testing are maintained and to note and record any unusual happenings such 
ag extreme nervousness, accidents, or illness. If the group is large these 
responsibilities are discharged with the help of his assistants or proctors, 
but the examiner is still the person ultimately responsible for the proper 
conduct of the test. 


Proctors 

Thirty or fewer pupils in a standardized testing situation can usually be 
handled by one person if the physical conditions, such as seating arrange- 
ments, are right. If the number being tested is between thirty and sixty, 
one proctor will be needed, from sixty to ninety, two, etc. The duties of 
proctors are to see that the pupils follow the directions given by the exam- 
iner, to see that pupils have pencils and other necessary equipment, to dis- 
tribute and collect test blanks, answer sheets, and other necessary materials 
at the proper time, and to help the examiner in every way to carry out his 
job as effectively as possible. 

The examiner should assign his proctors or assistants to definite sections 
of the room or to specific parts of the group being tested, and each proctor 
should be held responsible for his section or part throughout the test. The 
proctor should stay with that portion of the group and be available to its 
members at all times for any legitimate assistance. Inexperienced proctors 
occasionally make mistakes which can easily be avoided. For example, 
they sometimes line up at the front of the room while the examiner is read- 
ing directions. They should avoid this, remaining as much as possible at 
their stations or in the background. The examiner should be the sole focus 
of attention when instructions are given or directions read. As far as pos- 
sible, nothing should divert the attention of the pupils from the task at 
hand. 

If the testing is not being done by the pupils’ own teacher, he should 
endeavor to be as inconspicuous as possible; he should either leave the 
room or sit quietly at the back while the testing is in progress. This may 
sometimes create a little awkwardness, but it can usually be handled 
through the principal of the school a day or two before testing begins. 
The presence of the teacher in the formal testing situation may have 
effects which are not conducive to the best efforts of the pupils. 

Finally, proctors should remain alert and interested in what is going on 
during the entire test. Sometimes they tend to feel that once the test has 
successfully gotten under way no further attention on their part is needed. 
That may be just the time when they should be most alert; a pupil breaks 
his pencil, his pen runs dry, he turns two pages at once, or something else 
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happens which the alert proctor can remedy immediately. To be most use- 
ful, the good proctor should also be thoroughly familiar with the test and 
the details of its administration. Without this knowledge he cannot actively 
and properly assist in carrying out the administration of the test. 

In connection with the administration of standardized tests the question 
often comes up as to how far one may go in giving help. A good rule to fol- 
low is that no assistance of any kind should be given on the problems or tasks 
of the test proper. Also, it is generally considered good practice not to 
answer any questions regarding the test after work on it has actually begun. 
Though everyone is anxious for each pupil to do his best, the ideal standard- 
ized test situation requires uniformity of conditions for everyone being 
tested. Any act that gives one pupil more help or explanation than is given 
to all is not permissible. In some tests understanding and following direc- 
tions is part of the test, and in such cases no explanation of directions other 
than what is provided by the manual is allowed. The manual of a well- 
standardized test is usually quite explicit on what to say and read in giving 
the test, and it is not permissible to add to or depart from such instructions 


in any way. 


SCORING THE TESTS 


Who Shall Score the Tests? 

After the tests have been given, the “sixty-four dollar question” is how 
to get the scoring done. If arrangements can be made for having the tests 
scored mechanically there is usually no problem. In most situations, how- 
ever, hand-scoring is still the common procedure. The labor involved in 
scoring by hand has been greatly reduced by better arrangement of test 
items on the page, by the use of separate answer sheets with scoring stencils, 
and by the development of self-scoring techniques and other aids, However, 
we are still a long way from entirely relieving teachers of this job. If funds 
are available some of the work may be done by clerks, but of course this 
adds to the cost of testing. 

If teachers do the scoring it is very desirable to make some adjustment 
in their regular duties to give them adequate time for the work, Some of 
their classes may be dismissed or assigned temporarily to other teachers or 
substitutes, and they may be excused from some of their non-teaching 
duties for the time. These adjustments will not only help to get the scoring 
done quickly, but will also help to create favorable attitudes toward the 
testing program. Furthermore, there is the advantage that the scoring task 
will thus not become an extra burden on busy teachers. 
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Suggestions for Efficient Scoring 


In hand-scoring standardized tests there are certain methods and prin- 
ciples which can contribute much to the speed and accuracy with which the 
work is accomplished. 

In the first place, it is well to divide the task so that each person may 
develop speed and accuracy on a particular part of the scoring. For 
example, if there are eight pages or parts of a test it is more efficient to 
have one individual score one page or part on all the tests than to have 
him score each test in its entirety. If there are enough workers to assign 
one page or part to each, an assembly-line procedure can be set up to good 
advantage. If not, then each scorer should concentrate on one page or 
part at a time. When he has scored one page or part on all the tests he 
should then take the next, and soon. By this method the scorer sometimes 
is able to quickly memorize the scoring key so he can dispense with it en- 
tirely. This also contributes to the speed and accuracy since the scorer can 
concentrate on the scoring without the necessity of manipulating scoring 
keys or checking the keys with answers written on the paper. Moreover, 
with this system the scorer does not need to constantly shift his attention 
from one part to an entirely different part of the test. He concentrates on 
one until it is finished. 

The task of arriving at part scores and adding them to get the total 
score should also be done by one person. The transmutation of raw scores 
into percentiles, or standard scores, should be the separate responsibility 
of another individual, or, if it seems convenient and efficient, the one who 
works out part scores and total scores can also do this, but again, it is prob- 
ably more efficient for him to do one part of the job at a time. 

The assembly-line procedure should be planned and carried out so that 
test blanks can move along the line smoothly without piling up at any one 
point. This requires that the work be assigned according to the difficulty 
of the separate tasks and the particular skill of each individual. A slow 
worker will necessarily be given a smaller or easier job; otherwise, he may 
delay others. The person in charge of the scoring must experiment with 
his helpers and the task at hand until he has an efficient and agreeable pro- 
cedure worked out. 

Second, it is generally better to do the work of scoring in a group than 
to let individuals take tests away with them to score at their convenience. 
When the scoring is done in a group it is possible to settle questions about 
procedure, allowable answers, etc., on the spot. Working in a group makes 
the scoring more interesting and stimulating for those participating. Also, it 
avoids the delays often caused by one person's holding a batch of tests 
upon which no other work can be done until be returns them. On the 
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other hand, it is often difficult to find time or a suitable place for assembly- 
line scoring. However, it is often possible to find time when six or eight 
teachers can work together if, as has been suggested, the work is done 
during school hours. ? 

The place for the scoring of tests should be relatively quiet and free 
from interruptions, and the workers should be able to talk with each other 
freely without fear of disturbing others. If possible, there should be 
a large table or several smaller ones that can be put together so that all the 
scorers can work comfortably in a group. 

Third, all hand-scoring — the entire operation, from beginning to end — 
should be carefully checked for accuracy. Every step should be systemati- 
cally checked, and the person in charge should also check the accuracy of 
his individual workers. Wide variations will nearly always be found in the 
work of particular individuals; some will perform their tasks with few 
errors, others will seem unable to score accurately no matter how much 
they practice. 

Checking the scoring is best done by re-scoring a sampling of papers. It 
may be desirable at the beginning to re-score entirely the first few papers, 
perhaps five or ten, to see what mistakes are being made by individual 
scorers and to help correct these mistakes. After this, a sampling of every 
fifth paper or perhaps every tenth paper should be re-scored as a continuous 
check on the accuracy of the work. If it is found that errors are frequent 
and in all parts of the tests, it will often be necessary to re-score all papers. 
If errors are consistently found only in certain parts of the test or in certain 
phases of the work, it will suffice, as a rule, to re-score only those parts 
where error is found. 

There is one basic fact often overlooked by users of standardized tests, 
namely, that errors occur in all such work, and it is therefore absolutely 
essential that continuous and systematic checking take place. The less 
experienced the workers, the greater the probability of error, but regardless 
of the experience of the scorers, the scoring should not be accepted as final 
or the results recorded until every step of the process has been checked. 
To permit inaccurate scoring is a waste of time and money, and it may 
cause grave injustice to the pupils. 

Finally, it is important to train workers in the scoring process. One does 
not usually put scorers to work until they have had some prior instruction. 
It is usually desirable first to have prospective scorers read the manual, 
or at least the part dealing with scoring, so that they will understand what 
they are to do and how to do it. Then the person in charge should help 
score enough papers so that there is no doubt that the work is being done 
correctly. In this way one may be reasonably sure that the correct pro- 


cedure is being followed. 
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RECORDING AND ANALYZING RESULTS 


Recording Test Results 


After the tests have been scored and checked, the results must be made 
a part of the permanent records of the pupil and the school. Most schools 
have some sort of permanent record for each pupil. This may be in the 
form of a folder providing for the recording of information such as per- 
sonal and home background data, schools attended, marks, honors, disci- 
plinary or other special actions, and results of various tests. Many such 
record systems have been devised, some by national agencies, some on the 
state level, and some by the larger city school systems. In Figure 12, pages 
340 and 341, one such sample permanent record form is reproduced. 

Whatever the type of permanent record, it is important that there be 
one, that the test results be recorded as soon as possible after they are 
available, and that this record be readily accessible to those entitled to use 
it. The entering of test data upon the record form may be done by teachers 
for their own pupils, or by clerks if such help is available. These records 
are confidential and should not be available to pupils or other unauthorized 
persons. 

It is probably best to keep the permanent cumulative records of pupils 
in some central place such as the principal's office. In some schools records 
are distributed to teachers so that each will have in his custody the records 
of his own pupils. It may also be found desirable to have the records filed 
in the office of the counselor, if there is one. There are arguments for and 
against centralized and distributed record systems, but these need not be 
reviewed here. The important thing is that the records be used — not just 
filed — and it should be determined in each situation what arrangement is 
best for all concerned. 

In the case of most standardized tests a class record sheet accompanies 
each package of tests. This provides for the recording of names, ages, part 
scores, total scores, derived scores, class averages, and similar data. It is 
sometimes convenient to fill out the record sheet and give it to the teacher 
for his use, and to make the test results a part of the permanent record kept 
in a central place. But regardless of what record-keeping system is adopted, 
it must be remembered that the system will represent a great waste of time 
if the data recorded are not used. Sometimes one finds schools which keep 
remarkably complete records, yet these are taken out only for the regular 
and faithful recording of data. No one is ever observed using them except, 
possibly, to show them to an occasional visitor as an example of the fine 
records kept by the school. At the other extreme is the school that possesses 
elaborate record forms, but enters grades and other information sporadi- 
cally and unsystematically. Such records have little or no value, for even 
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if someone were interested in them the files would contain little if any 
systematic information. 


Analyzing Test Results 


Before anything can be done as a follow-up of testing there must be an 
analysis of the results. The analysis may be very simple in nature, as 
when a pupil’s rank in the class is determined, or it may be more compli- 
cated, as in the case of large-scale testing programs involving hundreds of 
schools and thousands of pupils in a large city system. Whatever is done, 
the basic procedures and techniques are usually statistical. A brief survey 
of statistical techniques, especially as they apply to educational and 
psychological test results, has been given in Chapter 3 and Appendix A. 
By the application of these techniques a teacher or counselor can make all 
of the usual types of analysis without going too far into technicalities. For 
analysis of a more advanced nature one of the standard works in statistical 
methods should be consulted. f 

It may be appropriate to say a word here on profiles and other graphic 
methods. It is sometimes erroneously assumed that when we make a 
profile, a percentile curve, or other graphic record of test results, we are 
making further statistical analysis of test results. It is possible that a 
graphic representation of data may clarify a statistical situation or even 
give new insight into its meaning, but a graph or chart simply expresses 
statistical data in another form. A profile is merely a graphical representa- 
tion of data which are already known. By extending or extrapolating a 
curve we may extend the data, but the results of such procedures are always 
hypothetical. Moreover, the same results may be determined statistically 
as well as graphically. 

The reason for emphasizing this matter is that users of tests and test 
results sometimes are led to believe that profiles, distribution curves, and 
other graphic methods add something which statistical data do not yield. 
Profiles and other graphic methods are often very helpful, but principally 
because they express findings in a way which is more readily grasped than 
mere statistics. They do not, as a rule, tell us anything about John or 
Mary which the numbers do not already say. 


ILLUSTRATIVE MEASUREMENT PROGRAMS 


To complete this discussion of measurement programs we shall offer a 
few illustrations representing actual practice in school systems of different 
sizes. Allof these are programs which have been developed through local 
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leadership and experience. They are practical, therefore, and their scope 
is such that they can be managed at the local level without undue expense 
or labor. Included are programs for (1) a small community school with no 
secondary grades, (2) a larger school including all grades from kindergarten 
through twelfth, and (3) a city system comprising a large number of ele- 
mentary schools plus junior and senior high schools. We shall not describe 
a measurement program for a very large city, for such programs are usually 
extremely complex and diverse. However, the basic principles and pur- 
poses will be the same, regardless of the size of the system. 


A Typical Program for the Small School 


The first measurement program to be described is one which any small 
elementary school enrolling 300 or fewer pupils can manage without out- 
side help. It follows closely the minimum program presented earlier in this 
chapter. It happens to be one which has been successfully administered 
for a number of years at the Stoner School near Lansing, Michigan. Some 
of the students in the writer's classes in educational measurement have 
participated in the program, have helped to administer and score the tests 
and analyze the results, and have had the opportunity to discuss with the 
teachers ways of putting the results to use. This has served to help the 
teachers and the school, and has provided important practical experience 
for a number of university students. 


Sroner ScHooL Testing PROGRAM 


Grade Test Time 
K Reading Readiness May 
I Intelligence February 
III Achievement Battery October 
Vi Intelligence February 
VI Achievement Battery October 


This program is simple and it is limited enough to be managed by the 
regular staff of the school without outside assistance. The results are used 
in various ways, including the identification of exceptional children, both 
slow-learning and gifted, counseling with parents, sectioning or grouping, 
diagnosis, and comparisons with national norms. 

The program gives information of several types, all useful in improving 
instruction. It reveals the stage of development or readiness of beginners 
for learning the fundamentals, it gives several appraisals of each child's ' 
general capacity for school work, and it provides a reasonably adequate 
survey of each pupil's progress in the common branches of the elementary 


3 Victor H. Noll and Marvin D. Glock, “Functional Courses in Measurement and 
Evaluation," School and Society, 70:339-40 (November 26, 1949). 
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school curriculum. The entire cost of the program, including test material 
and some clerical assistance for recording the results, can be kept to less 
than one hundred dollars per year. This allows nothing, of course, for the 
time of the school staff or for any outside help. However, when teachers 
become educated to the value of the results of their work and feel that 
they have a real part in the program, they are usually glad to contribute a 
reasonable amount of time and energy to the work of carrying it through. 


A Typical Program for the Larger School 


The program to be described next is one that might fit a school system of 
almost any size, though its scope is such that it can be managed by a school 
of moderate size without a particularly large specialized staff. The program 
calls for one part-time coordinator at the elementary level and one part- 
time coordinator at the junior-senior high school level. Each of these 
coordinators may be a counselor or guidance person, or a teacher whose 
schedule of regular work has been adjusted for this purpose, and who has 
had some instruction in measurement. 


Measurement PnocnaM: TwELvE Grapes * 


Grade Test Time 
K Reading Readiness May 

I Intelligence January 
II Achievement, Battery October 
II Intelligence April 
VI Achievement Battery October 
VI Intelligence April 
VII Reading October 
VIII Algebra Aptitude April 
IX Vocational Interests October 
IX Intelligence April 
IX Personality (limited use as appropriate) 

XI Reading October 
XI Vocational Interests October 
XI Intelligence April 
XI Personality (limited use as appropriate) 

XI Aptitude tests (clerical, mechanical, etc., 


in individual cases) 


Tn this program, by way of summary, three tests which measure a pupil's 
general mental ability, plus a reading readiness test, are given in the ele- 
mentary grades; two surveys of achievement are also given. In the high 
school two more tests of intelligence, two measures of vocational interests, 
two reading tests, and an algebra aptitude test are given. In addition, the 


4 Courtesy of Okemos, Michigan, Public Schools. 
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high school counselor uses a personality test at his discretion in individual 
cases, Some tests of specific aptitudes are also given by the counselor to 
individual pupils in the eleventh grade, as seems desirable. The entire 
cost of this program for 1,500 pupils, including all materials, supplies, and 
some clerical services, is about $600 per year, or forty cents per pupil. 

It will be seen that the schedule is so arranged that practically all of the 
testing comes in October and April. This has the advantage of causing 
less disruption of the regular schedule than if the testing were scattered 
throughout the year. It enables the staff and administration to plan reg- 
ularly for the testing and to concentrate on it at stated times. It may 
also have the advantage that pupils are not taking the same or similar tests 
at different times — a condition which sometimes gives those who take 
the tests later an advantage over those taking them first. 

The administration of this measurement program is in the hands of the 
Director of Elementary Education and the Director of Guidance. The 
tests are administered by these persons, or under their supervision. They 
are responsible for the scoring and analysis of results. Most of the routine 
scoring is done by clerks, some of whom are high school students who work 
under careful supervision during free school hours and after school. Where 
such help is used the test papers are coded so that the pupils’ papers cannot 
be identified, After the tests are scored the raw scores are converted to 
appropriate derived scores, and profiles are constructed by the teachers. 
The results are recorded in the pupils’ permanent record folders. The 
records for the first six grades are kept in the classroom for use by the 
teacher, and the high school pupils’ folders are kept in the guidance office. 

Results of this program are used in many ways, such as for the identifica- 
tion of retarded and gifted children for special instruction. The reading 
readiness testing usually reveals a number of pupils in the kindergarten 
who are not ready to begin first-grade work, and special provisions are 
made for such cases. The reading tests at the higher grade levels serve to 
identify pupils who have reading difficulties. The school maintains a 
reading improvement service with the part-time help of a reading specialist. 

The algebra aptitude test is used to determine whether or not pupils are 
ready to take algebra. If not, they are advised to take a course in general 
mathematics instead. When the general course is completed satisfactorily, 
and a pupil still wants to take algebra, he is then enrolled in that course. 

The intelligence and achievement tests are used in sectioning pupils ac- 
cording to capacity and past performance, wherever this is feasible and 
advisable. The interest inventory is used as a starting point for high 
school seniors in a class in occupations. Here, pupils examine their own 
interest profiles and make studies of occupations which the inventory sug- 
gests should be of particular interest. The studies may include visits to 
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places of business, industrial plants, hospitals, schoolrooms, etc., according 
to the occupations being considered. 

The results of the measurement program are used to identify individuals 
who may need psychiatric or clinical services, they are used in educational 
and vocational guidance, and they are used to assist in bringing about better 
understanding and better relationships among school personnel, pupils, 
parents, and in the community in general. These results serve many partic- 
ular purposes, but the most important purpose is the improvement of the 
educational program and the consequent improvement in the functioning 
of the entire school system and its contribution to the community. It is 
believed by those responsible that the measurement program in this com- 
munity contributes substantially to these purposes. 


A Typical Program for a City System 

The third program to be described is one that has been developed by 
the Long Beach, California, school system enrolling approximately 65,000 
pupils in elementary schools and in the junior and senior high schools. 
Altogether, seventy schools are involved, of which fifty-one are elementary, 
Grades K-6; fourteen are junior high schools, Grades 7-9; and five are 


senior high schools, Grades 10-12. 
The basic program followed throughout the district is shown below: 


LONG BEACH PUBLIC SCHOOLS 


ELEMENTARY SCHOOLS 


Grade Test Week 

* III-VI Local Spelling Tests 3 

II Primary Reading $ 8 

Ni Achievement Battery 9-10 

V Intelligence 12-14 

Til Intelligence 20-21 

VI Arithmetic 31-32 
* [II-VI Local Spelling Tests 32 

IH Arithmetic 36-37 


* Scores of Grades IV and VI are reported to the research 
office. Scores of Grades III and V are for building use only. 


Junior Hicu ScHoors 


Grades Test Week 
IX Language Arts 2 

** VII-VIII Local Spelling Tests 3 
VIL Reading and Language Arts 5 


Vil Intelligence 7-8 
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IX Reading, Language Arts, Spelling 21-28 
VII Arithmetic t 31-32 
** VII-VIII Local Spelling Tests 32 
Vill Arithmetic 33-34 
VII Geography s 35-36 


** The test scores of Grade VIII are reported to the research 
office. Scores of Grade VII are for building use only. 


Senior Hic ScHoors 


Grades Test Week 


x Intelligence 2-4 
Mechanics of Written English 
Reading Comprehension 
XI *** Arithmetic 28-31 
Mechanics of Written English 
Reading Comprehension 
American History 36-38 
X Mechanics of Written English 37-39 


*** Arithmetic scores are for building use only. 


The tests in the basic program are administered to all pupils at the 
levels and at the approximate times indicated. Tests are usually admin- 
istered by counselors or by teachers, though sometimes in an elementary 
school they are administered by the principal. Those tests which require 
hand-scoring are scored by school personnel; where machine-scorable 
answer sheets are used the scoring is done in a central office, and the scores 
are returned to the school for tabulation of results. In general, tabulations 
of all test results are sent to the office of the Director of Research, but the 
scored tests are kept at the school. Test results are recorded in each pupil's 
cumulative record. 

The Long Beach Director of Research has developed a series of mimeo- 
graphed instruction sheets which supplement the manuals for the various 
tests, and these instruction sheets provide teachers with practical sug- 
gestions on administering and scoring tests, tabulating, analyzing, and 
recording results, and reporting them to the office of the Director. Such 
carefully worked-out instruction sheets are essential in a large program such 
as this where personal contact with all cooperating teachers is impossible. 

In addition to the district-wide program which begins with the third 
grade, each elementary school gives tests in the kindergarten and first two 
grades. These tests are selected by the personnel of the individual schools 
and vary considerably from one school to another. 'The most common 
procedure is to give a mental or a readiness test in Grade 1, and a reading 
test in Grade 2. To supplement the program of required or district-wide 
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testing, each elementary school has an opportunity to select additional 
tests for optional use in Grades 3 to 6. 

In the junior and senior high schools, tests of interests, aptitudes, and 
personality, which are selected by the school counselors, are given in ad- 
dition to the tests included in the district-wide program. All senior high 
schools use a vocational interest test and an aptitude battery. Personality 
tests are used on an individual basis wherever counselors deem them useful 
and appropriate. Many of the junior and senior high schools also elect to 
give standardized achievement tests in such fields as foreign language, 
mathematics, and science. 

The program may be summarized as follows: 


ELEMENTARY GRADES 


Readiness and Intelligence I, HI, V 
Reading Hi 
Arithmetic III, VI 
Achievement Battery (includes reading, 
arithmetic, language) Vi 
Spelling All grades 
Junior Hicn Scuoors 
Intelligence VII 
Reading, Language VII, IX 
Arithmetic VII, VIII 
Geography VII 
Spelling All grades 
Senior Hicu Scuoors 
Intelligence X 
English, Reading X, XI 
American History XI 
Arithmetic XI 


The bulk of the testing comes in the third, fifth, seventh, and tenth 
grades, Each is a crucial year, and the information derived from the 
measurement program thus becomes available at a time when it will gen- 
erally be most useful. In the elementary school there is information on 
the pupil's progress in reading, arithmetic, and spelling, and on his in- 
telligence, all of which are fundamental criteria for evaluation. A survey 
battery gives an appraisal of his status in other subjects as well. 

'The junior high school program continues and builds on the program 
developed in the elementary grades. In the seventh.grade pupils take tests 
of intelligence, reading, language, arithmetic, geography, and spelling. 
This gives junior high school teachers and counselors a fairly complete 
appraisal of their pupils early in the program. 
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The district-wide testing program in senior high schools concentrates 
mainly on mental ability, English mechanics, reading, U.S. history, and 
arithmetic. These tests are supplemented (as are those in junior high 
schools) with tests of interests and aptitudes, with some achievement tests 
in fields not covered by the required program, and, to a limited extent, with 
tests of personality. By the time a pupil finishes high school he will have 
had at least five intelligence measures and a thorough appraisal of his 
achievement in reading, language, arithmetic, spelling, geography, and 
history. 

Such a program requires the leadership and assistance of a central co- 
ordinating agency; in the case of the Long Beach school system the 
supervisor of the program is the Director of Research, though the major 
share of the work is divided among the teachers and counselors so that no 
one will be overburdened. The results are reported to the central office, 
but the scored tests are kept in the respective schools where they may be 
interpreted and put to use. This program, moreover, is not an expensive 
one.’ 

The results of the Long Beach testing program are used in various ways, 
including the following: 

1. To assist the central office staff — particularly those in the Division of 
Instruction — in identifying curricular fields of strength and weakness for the 
total district. If the district-wide test results in a given field are not satis- 
factory, it means that special effort must be made to analyze the causes and 
then to correct the conditions which are believed to be responsible for the 
weakness in pupil achievement. 


2. To provide school administrators with one basis for evaluating pupil 
achievement in individual schools of the district. For example, each principal 
maintains a set of graphs on which is charted the tested achievement of pupils 
in his school for previous years and for the current year. By examining these 
graphs a newly appointed principal can see what the past record of his school 
has been on reading tests, for example, and he has a basis for judging whether 
the current year’s test scores are in line with those of previous years. 


3. To identify those pupils in the district who rank in the upper 4 per cent 
of general scholastic aptitude so that these “Very Superior" pupils can be 
given individual counseling and instruction adapted to their abilities. 


4, To identify pupils who should be re-tested individually in order to de 
termine whether they should be placed in special classes for the slower learners. 


5. To furnish teachers and counselors with test data needed in student 
counseling. 


5 Dr. Anton T! ‘hompson, Director of Research in the Long Beach, California, school 
system, reports in a personal communication that the cost for tests and materials per 
pupil in average daily attendance in the elementary schools was fourteen cents in 1954-55; 
for high schools it was twenty-five cents per pupil. This does not include cost of services 
such as scoring, recording, and statistical analysis. 
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6. To give teachers an objective basis for judging pupil progress in achieve- 
ment. For example, the senior high teacher of U.S. history can compare the 
achievement of his students with that of other pupils in the nation who were in 
the norm group. 


7. To supply teachers with certain instructional clues. Following some 
surveys, an analysis of the pupil's answers (an “‘item-count”’) in a sample of 
test papers is made. A report is then prepared so that teachers can learn 
which types of skills and understandings seem to have been satisfactorily 
mastered, and which types appear to be in need of strengthening. 


8. To maintain public confidence in the local schools. During a given week 
thousands of pupils tell their parents that they have just taken a standardized 
reading test. Many parents are impressed with the fact that the school system 
not only teaches reading (and the other basic skills), but also checks up on its 
teaching efforts. Another important use in public relations: the district-wide 
data can be used by school administrators to provide a factual answer to an 
occasional critic making an unfounded charge concerning the general level of 
achievement of local pupils. A citizen sometimes bases criticism of a school’s 
educational program upon extremely limited evidence, perhaps on his sub- 
jective judgment as to the school accomplishment of a few pupils or gradu- 
ates he has met. In such circumstances a skillful administrator can use the 
results of the district’s testing program to build and maintain public confidence. 
He might tactfully show the critical citizen (assuming the critic is a reasonable 
person) that according to the results of a recognized achievement test given 
to the total school population of many thousands in a grade the district median 
represents a satisfactory level of accomplishment. 


Other testing programs in similar-sized communities could be described, 
and these would differ in such details as grade placement of tests, and areas 
tested. However, the three programs we have described have been found 
workable and useful in their respective communities. They are not ex- 
tremely elaborate, but are modest in cost. They serve to illustrate the 
principles discussed in the first part of this chapter and should be sug- 
gestive of what can be done by other communities with similar needs and 
purposes. 

It is probably unnecessary to state that in any measurement program 
the teacher or supervisor should anticipate the necessity for re-testing in 
doubtful cases, and provision should be made for this on an individual 
basis. Accidents will happen to prevent pupils from finishing a test; there 
will be surprisingly low, or sometimes surprisingly high, scores, and 
of course there will always be absentees. Wherever a child misses the 
testing, or when there is good reason to believe that he has not been ade- 
quately or accurately measured, a make-up test is in order. Whenever an 
alternate form of the same test is available, it is preferable to use it in 
such cases, especially where a child has started a test, but has been unable 
to finish. It is usually more satisfactory to have him take the entire test 
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in another form than to have him attempt to go on from the point where 
he stopped. i 

It should be understood that the discussion in this chapter has been 
concerned entirely with group tests. It is recognized that in some cases 
an individual examination like the Binet, the Wechsler, or other individual 
tests may be used. However, such examinations must be administered 
by a person with special training since most classroom teachers have not 
had the necessary instruction in the use of these instruments. Where such 
tests are needed and used it is assumed that qualified persons are available 
to administer them, and that the tests will be given as necessary. 
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This assignment involves the application of much that has been presented in 
this chapter. It may be carried out as a class project in which different members 
assume responsibility for certain aspects of it, or as a term paper to be prepared 
by individuals. After completing the plan, apply the criteria discussed by Traxler ê 
and see how well they are met. 


PLANNING A TESTING PROGRAM 


Set up a plan for the organization and administration of a program of testing or 
evaluation in a school. Assume that the school includes all levels from kindergarten 
through twelfth grade. You have a limited amount of money to spend, perhaps $250 
the first year and $200 per year thereafter. The enrollment is approximately 550, 
with about 40 pupils in each grade from kindergarten through the twelfth. 

There has been no systematic program of this nature up to the present, and you 
may assume that you are “starting from scratch.” This is to be a continuing pro- 
gram, and you should plan what you would do the first year and annually thereafter, 
within the limitations of the funds available. 

Assume that the work will be done by the teaching staff under your leadership 
and that facilities for machine-scoring of tests will not be available. 

Work out a comprehensive and detailed plan, telling how you would: 


1. Determine the purposes of the program (List possible purposes.) 

2. Secure the cooperation of the staff 

3. Find out about available instruments 

4. Decide what instruments you would need to meet your pur- 
poses, and select, and order or develop them. 


5. Plan the administration of the tests (Make a tentative sched- 


ule like that on-page 345.) 

6. Train personnel for the administration, scoring, and analysis 
of results 

7. Get the job done 


8. Use the results to achieve your purposes 


6 Arthur E. Traxler, “Fifteen Criteria of a Testin; m." The Clearing House; 
25:3-7 (September, 1950). g Program, e Clearing A 
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Using the Results of Measurement 


In preceding chapters many kinds of measuring instruments have been , 
"described and evaluated. We have suggested or discussed the uses of these 
instruments in many instances. In Chapter 13 we emphasized that meas- ' 
urement programs should be planned and carried out in the light of definite 
purposes. In that chapter also, a number of broad objectives or purposes 
for a measurement program were cited, and it was stated that these would 
be discussed in some detail later. Below are listed the objectives men- 
tioned there. 


Classification of pupils 

Homogeneous grouping 

Diagnosis and remedial work 

Counseling and guidance 

Marking 

Motivation 

Identification and study of exceptional children 
. Interpreting schools to the community 
Improvement of school staff 

Educational research 


Shitana 


These are probably the main purposes for which tests and other measure- 
ing instruments are used in the schools today. Certainly they seem im- 
portant and worthy of careful consideration; it will be the aim of this 
chapter, therefore, to discuss these purposes and to show how measurement 
can contribute to the attainment of each. 


CLASSIFICATION OF PUPILS 


"The placing of pupils at particular grade levels, retarding them, or ac- 
celerating them — all of these are aspects of the classification problem. 


Pupils must be placed at levels where they can learn without being unduly 
354 
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discouraged, overworked, or bored. Placement at particular grade levels 
may be thought of as verlical classification, as contrasted with horizontal 
grouping which will be discussed shortly. The most useful test for such 
classification or grade placement is probably a general school achievement 
battery. These batteries have grade norms for the separate subject tests 
such as English, social studies, mathematics, ete., and there are norms for 
the battery as a whole. Additionally, there are usually norms for con- 
verting the scores into age levels. With these two types of norms it is 
possible to determine fairly accurately a pupil's grade level with regard to 
achievement in school subjects. 

When a pupil transfers from one school to another, particularly when 
he moves to a different school system or state, a standardized achievement 
battery is one of the most dependable tools for determining his level of 
achievement. It is probably more accurate to speak of levels of achieve- 
ment, since these will not always be the same in different subjects or parts 
of the test battery, and consequently the profile, based on scores on the 
different subject tests, will generally not be a straight line. Figures 13, 14, 
and 15 illustrate the types of profiles obtained from scores on achievement 
batteries. 

Figure 13 


A Pupil's Completed Profile Chart, Based on his Scores on the Stanford 
Achievement Test, Advanced Battery 


: Truman A ford Achievement Test; Directions Administering 
(Souree; Truman Te Kelley alate Batteries: Yonkere-ob-l{udson, NA. World Book 
Company, 1953. Reproduced by permission of the publisher.) 
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Figure 14 


A Profile Showing Achievement Test Scores for the Same Student in 
Grades 7, 8, and 9 


Tr [| me | MATH T 
LANI SPELL. | TOTAL 
score | voc. | COMP, | Reas. | FUND. | SCORE 


86 86 
84 
iod —n— 

82 Sie. 82 
80 80 Sas 80 
78 78 
hones ak eee res 76 
74 76 —ag—|— 8 —-|— 4 — | — 57 — —316 — 74 
72 — 28 —| — 309 — 72 


The upper line represents his scores in Grade 9, the middle line 
his scores in Grade 8, and the lower line his scores in Grade 7. 


(Source; E. Gordon Collister and Kenneth E. Anderson, "A Method of Reporting Test 
Results," University of Kansas Bulletin of Education, 8:45-51 (February, 1954]. Reproduced 
by permission of the authors.) 


As these illustrations show, a pupil’s profile may be used to make useful 
comparisons such as: 


a. Comparison of his standings in the different subjects 

b. Comparison of his standings with national or other types of 
norms 

c. Comparison of the pupil’s achievement from year to year 

d. Comparison of his achievement with that of other pupils in the 
same class or grade 
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Figure 15 


A Profile Comparing the Achievement Test Scores of Four Seventh- 
Grade Students with the Mean for Students in Grade 7 


T [READ Mata] 
score | voc. | come. | REAs. | FUND. | 
BEEFT Ea | 


T 
ORE 

86 
84 


Lines A, B, C, and D represent four different seventh-grade students. 
The top broken line represents one standard deviation above the 
mean, the middle broken line represents the mean, and the bottom 
broken line represents one standard deviation below the mean. 


(Source: E. Gordon Collister and Kenneth E. Anderson, "A Method of Reporting Test 
Results,” University of Kansas Bulletin of Education, 8:45-51 [February, 1954]. Reproduced 


by permission of the authors.) 


It is usually desirable in classifying pupils to give a general intelligence 
test in addition to the other tests so that both educational and general 
mental development can be taken into account in placing the pupil. Then, 
knowing his grade level of achievement, his chronological age, his mental 
age, and his T.Q., we can arrive at a decision based on objective data rather 
than guesswork. If the pupil’s previous school record is available, this 
too should be taken into consideration, though for grade placement with 
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respect to subject-matter achievement the test battery is probably the 
most accurate measure. 

Thus, for example, if a pupil is 9 years, 6 months of age, with an over-all 
achievement of the seventh month of the fifth grade, and a mental age of 
10 years, 8 months, we can form a fairly accurate estimate of his level of 
development and achievement. His /.Q. is 10-8 (or 128 months) divided 
by 9-6 (or 114 months) X 100, or 112. Assuming school entrance was 
between 5-6 and 6-0 (which can usually be checked), we have the following 
data: 


Age M.A. Grade Level FQ! Year in School 
9-6 10-8 5-7 112 5th 


In this case the picture is fairly clear. The pupil is somewhat above 
average in mental age and intelligence, and is at grade for his age and years 
in school. He could properly be placed in the fifth grade and encouraged 
to do extra work and outside reading to satisfy his accelerated mental 
development. 

In cases involving retardation or acceleration the same types of in- 
formation about the individual will be useful. The weight of evidence 
and opinion seems to be against holding pupils back, particularly at upper 
grade levels. Studies show that most pupils repeating a grade do just as 
badly or worse than they did the first time. If this is generally true, 
there seems little to be said in favor of holding back over-age pupils. 
However, tests can be very useful in determining a slow-léarner's level of 
achievement and his mental level. A pupil aged 12 years with a reading 
level of third grade, a mental age of 9, and an I.Q. of 75 will surely flounder 
hopelessly if he tries to do sixth-grade work. He will probably never go 
beyond the eighth grade and can hardly be expected to do satisfactory 
work as compared with average pupils; yet he can be helped by special in- 
struction in reading, arithmetic, and other subjects, with material at the 
third-grade level in difficulty, but which is suitable in content for a twelve- 
year-old. The necessary preliminary information for the proper adjust- 
ment of a program to individual differences in ability and achievement is 
provided most quickly and efficiently by standardized tests. 

The question of acceleration is related to that of non-promotion. If it 
is undesirable to hold back slow pupils, the same argument applies in the 
case of bright ones. When a very able pupil is kept at grade we are in 
effect not promoting him just as surely as when we make a dull pupil repeat 
a grade. Pressey has made a thorough review of the substantial evidence 


1 Henry J. Otto, “Pupil Failure as an Administrative Device in Elementary Educa- 
tion,” Elementary School Journal, 34:576-89 (April, 1934). 
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on this point? The evidence shows clearly that students who are ac- 
celerated by extra promotions in school, or who are permitted to finish 
college in less than the usual time, seem to do well in their studies and are 
well adjusted socially; they seem not to suffer in health, and seem not to 
be handicapped otherwise as a result of their acceleration. Nevertheless, 
many school authorities are reluctant to adopt such practices. Parents 
also often wish their children to remain with “their group,” and are fearful 
of what acceleration might do to their children's social adjustment. It is 
unfortunate that high school and college students who are both mentally 
and physically advanced for their age are generally required to sit through 
all the regular lessons and classes when they could proceed much faster in 
the environment of a more advanced grade or class level. Not only would 
acceleration save much valuable time, but it would also help to avoid the 
boredom and the bad study habits which may be developed by a bright 
pupil who is required to wait while the slowest in the group catches up. 

When achievement tests and mental tests reveal that a pupil is ac- 
celerated in achievement and mental development, and if he has no physical 
or social handicaps, he should be carefully considered for extra promotions 
and adjustment of work to his level. Moreover, he should be encouraged 
to progress through school and college as rapidly as he can, and at every 
level the tasks presented him should be in keeping with his abilities and 
should require appropriately high standards of work. Able students are 
needed in the world of today, perhaps more than ever before. To hold 
them back on the basis of fears which seem quite unsubstantiated by the 
available evidence is an injustice to them and to society. 


HOMOGENEOUS GROUPING 


Various methods of grouping pupils according to ability are widely 
practiced, particularly in the elementary grades. The grouping may be 
done informally and more or less subjectively, as when a teacher of second 
grade forms her forty pupils into reading groups of ten to fifteen on the 
basis of reading ability. Or it may be done formally, as when a hundred 
pupils in the fifth grade are divided into three classes according to general 
mental ability. 

This type of classification, by which pupils in the same grade are grouped 
according to ability, may be thought of as horizontal, in contrast with the 
vertical grouping already discussed. Ability classification, or homogeneous 


Appraisals and Basic Problems, Bureau of 


28. L: , Educational Acceleration; 
S D E Ohio: Ohio State University, 1949). 


Educational Research Monograph No. 31 (Columbus, 
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grouping, is a rather controversial subject and has been for many years. 
In the early days of educational measurement such grouping was looked 
upon with more favor than it appears to be at present, particularly at the 
high school level. Grouping in one form or another is common in the ele- 
mentary schools where pupils read, recite, and do other academic work in 
groups whose members are usually judged by the teacher to be somewhat 
alike in ability. However, in the elementary school pupils are not sectioned 
and separated as high school pupils usually are, and therefore the objection 
that grouping at the elementary level is undemocratic does not apply, or 
at least seems less important. There is ample opportunity in the elementary 
classroom for social intercourse among all pupils in a class, even if they are 
grouped according to ability in reading, arithmetic, and other academic 
work. 

Ability grouping is probably practiced widely, whether or not it is an- 
nounced or so labelled. When there are multiple sections in ninth-grade 
English, for example, there can be little question that many principals 
form them on the basis of ability as measured by 7.Q., or previous marks 
in English, or both. When this is done as a matter of course and without 
publicity, it is generally accepted by teachers, pupils, and parents as a 
sensible plan. In fact, the available evidence seems to indicate that where 
ability grouping is practiced the majority of pupils, parents, and teachers 
are happy and satisfied with it. 

Critics of homogeneous grouping have often stated that there is no such 
thing as a homogeneous group. Their argument is that when groups are 
formed on the basis of T.Q., they are still heterogeneous with respect to 
age, or physical development, or some other factor. It may be said on this 
point that we know of no claims that grouping pupils according to /.Q., 
for example, makes the groups homogeneous in every other respect. If 
100 pupils are sectioned on the basis of 7.Q. into three classes with /.Q.'s 
ranging from 75-90, 91-110, and 111 upward, they are obviously more 
homogeneous with respect to J.Q. The range of T.Q.’s in each of the groups 
is substantially shorter than it would be if the three sections were formed 
at random. Since the 7.Q. is known to be related to learning ability, we 
do assume that in making the groups or sections more homogeneous with 
respect to I.Q., we also make them more alike in ability to learn. If it were 
desired to make them even more homogeneous with respect to learning 
ability, groups could be formed on the basis of several factors such as 
mental age, /.Q., and school marks, simultaneously. 

Moreover, where groups are formed on the basis of the T.Q. alone, some 


3 The Grouping of Pupils, Thirty-Fifth Yearbook of the National Society for the 
Study of Education, Part I (Bloomington, Ill.: The Public School Publishing Company; 
1936), pp. 302-03. 
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problems are likely to develop. Consider, for example, two fifth-grade 
pupils. 


Pupil A: I.Q. = He = 1-185 
Pupil B: 1.9. = 4: BAL 133 


If these two and others like them were grouped together on the basis of 
I.Q. alone, the results might be quite unsatisfactory, since A is almost two 
and one-half years more mature mentally and nearly two years older than 
B. If both 7.Q. and M.A. were used as the basis of classification, these 
two pupils would not be in the same group. A would be in a bright group 
of more mature pupils, and B would be in an equally bright but younger 
group. Both would be more at home and would be more likely to achieve 
close to the maximum of their respective abilities under these conditions 
than if they were in the same group. 

In such plans for the formation of more homogeneous groups one takes 
into account not only intelligence as expressed by the /.Q., but also mental 
maturity. In this way groups are formed that are more alike with respect 
to the intelligence quotient and with respect to the level of complexity and 
difficulty of tasks with which they can deal successfully and profitably. 

Whatever the basis used for grouping pupils, two factors have a most 
important bearing on the success of the grouping. First, it must be recog- 
nized that what teachers do in adapting content and method to different 
ability groups largely determines how effective the grouping will be in 
terms of increased learning and attainment of goals, To form different 
groups according to ability without making modifications in the methods 
and materials used with those groups is not likely to result in any advan- 
tage. The evidence on grouping suggests that, “ Experimental studies have 
in general been too piecemeal to afford true evaluation of results, but when 
attitudes, methods, and curricula are well adapted to further the adjust- 
ment of the school to the child, results both objective and subjective seem 
to be favorable to grouping." * 

Second, in any scheme of grouping there should be provision for the 
shifting and adjustment of individual pupils. No pupil should feel that 
he is finally and permanently attached to a particular group. The able 
pupil must demonstrate his ability to stay in a faster group, while the slow 
pupil should always be made to feel that he can change his status by doing 
better work. Such flexibility, though perhaps difficult from the adminis- 
trative standpoint, would go far toward meeting one of the most frequent] y- 
voiced objections to ability grouping, namely, that it is not democratic. 


* Ibid., p. 304. 
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Moreover, it would provide excellent motivation for pupils at all levels and 
in every group. 


o Learning Exercises © 


1. Distinguish between vertical classification and horizontal classification. What 
are the common bases or criteria for each? 
2. Summarize the arguments and the evidence, pro and con, for each of the follow- 
ing: r 
a. Classification by grade versus non-graded classification 
b. Homogeneous versus non-homogeneous (or heterogeneous) grouping 
3. If you were to form three ability sections in ninth-grade general science, how 
would you proceed? Assume that 100 pupils have elected the course. 


DIAGNOSIS AND REMEDIAL WORK 


In contrast to the two areas of usefulness just discussed, the employment 
of tests for diagnosis is an instructional function rather than an adminis- 
trative one. The purpose of a diagnostic test is to find the specific weak- 
nesses and strengths of a pupil in a particular area of study or subject 
matter. In the survey of measurement practices and preferences of high 
school teachers by Noll and Durost,5 it was found that diagnostic testing 
and remedial work are the most frequently mentioned uses of standardized 
test results. Between 40 and 50 per cent of the teachers using standardized 
tests reported these as their purposes in giving such tests. 

The process of diagnosis in education may be thought of as a progression 
from broad, general areas to narrower and more specific knowledges or 
skills. For example, one might begin by giving to a class of seventh-grade 
pupils a survey battery including tests of language arts, social studies, 
mathematics, and science. After these tests are scored and the results 
analyzed, it might appear that the class as a whole is up to or above the 
acceptable standards in all areas tested except language arts. Further 
testing in the language arts might show that vocabulary, reading, and 
spelling are acceptable, but that there are serious weaknesses in funda- 
mentals of grammar, sentence structure, punctuation, and capitalization. 
Knowing of these weaknesses, one can then proceed to give diagnostic tests 
in these areas of English composition to determine what rules of grammar, 
sentence structure, and the other phases have been inadequately mastered 


5Victor H. Noll and Walter N. Durost, Measurement Practices and Preferences of High 
School Teachers, Test Service Notebook No. 8 (Yonkers-on-Hudson, N.Y.; World Book 
Company). 


Diagnosis and Remedial Work 363 


and are in need of further study and drill. The real diagnosis is done only 
at this last level of measurement, although all which precedes it is basic to 
the last step. However, a teacher may begin at this point without going 
through the earlier steps. That is, he may give his pupils a diagnostic test 
at any time to discover strengths and weaknesses in pupil learning so that 
he may review and improve points not mastered through previous teaching 
and study. 

A truly diagnostic test is usually planned and constructed with this 
function in mind. Many achievement tests may be used for diagnosis, but 
much time and energy is saved and a more systematic analysis is possible 
when a test is designed and built with diagnosis in mind, if that is to be its 
function. There are several important steps to be followed in the diagnostic 
testing procedure, and these are listed below. 

First, there should be a careful analysis of the rules, principles, knowledges, 
or skills which the test is intended to measure. In the example cited above, 
this would mean analysis of the rules or principles of good usage with re- 
spect to grammar, sentence structure, punctuation, and capitalization in 
English. The Pressey Diagnostic Tests in English Composition * are good 
examples of tests based on this kind of analysis. The test in each of the 
four areas covers the basic rules in that area. For example, the punctua- 
tion test covers such rules as, “Every declarative sentence should be fol- 
lowed by a period." The capitalization test covers such rules as, “Begin 
every proper name with a capital letter,” and similarly appropriate rules 
are covered in the tests of grammar and of sentence structure. 

Second, a good diagnostic test is planned and constructed so thal every rule 
or principle is adequately and equally tested by objective ilems. For example, 
in the Pressey test on capitalization each of seven basic rules is covered by 
four objective test items. By this method, no point of importance is 
slighted or over-emphasized, and the user of the test can be sure of reason- 
ably adequate and systematic coverage. 

Third, the lest items are generally arranged in groups lo facililale the analysis 
and diagnosis. That is, if there are four items on each rule, those dealing 
with the same rule will be placed together rather than scattered through- 
out the entire test. This makes it simpler, in analyzing the results, to de- 
termine specific areas of strength and weakness. 

In addition to the above-mentioned principles, diagnostic tests are usu- 
ally accompanied by a chart similar to the one reproduced in Figure 16. 
This permits the diagnosis of strengths and weaknesses of the class or 
group as a whole, as well as of individuals. 

The diagnostic chart for Miss Jones’s seventh-grade class reveals the 


6S. L. Pressey, Diagnostic Tests in English Composition (Bloomington, Ill.; Public 
School Publishing Company, 1924). 


Figure 16 


nalysis Chart Showing the 
esults of a Seventh-Grade 
lass on the Pressey Diag- 
ostic Test in Capitalization 


(Reproduced by permission of the 
ublie School Publishing Com- 
any, publisher.) 


CAPITALIZATION TEST 
RULES COVERED BY TEST 


1. Capitalize the first word of every 
sentence. Capitalize also the first word 
of every line of poetry, and the first 
word of a direct quotation. However, 
if the quotation is indirect do not use 
the capital. 


with their titles; 


however, 
capitalize titles when they are not 
part of a name. 


3. Capitalize the names of coun- 
tries, states, cities, streets, buildings, 
of mountains, rivers, oceans, or any 
word designating a particular loca- 
tion or part of the world; however, do 
not capitalize the points of the com- 
pass, or such terms as ‘street, river, 
‘ocean, when not part of a name. 

4. Capitalize the names of business 
firms, schools, societies, or other or- 
ganizations; however, do not capital- 
ize such words as company, school, 
society, when not part of a name. 

5, Capitalize words derived from 
the names of countries, places, or- 
ganizations or persons. 

6. Capitalize the days of the week, 
the months of the year, and holidays; 
however, do not capitalize the seasons. 


7. Capitalize the first word, and all 
other important words, in titles (and 
sub-titles and headings) of themes, 
magazine articles, poems, books, of 
laws or governmental documents, and 
the trade names of commercial prod- 
ucts. 
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standing of individuals and of the class. Reading across the page, it is 
possible to determine individual needs in capitalization. Jane Allen, for 
example, misses only 5 items out of 28, and these deal with Rules 4, 5, 6, 
and 1, respectively. On the other hand, Mary Brady appears to be very 
weak on capitalization, for she has answered correctly only 8 out of 28 and 
has missed most of the items relating to every rule except No. 2. 

Reading the chart vertically, by groups of items relating to each rule, 
we see that the class as a whole is strong on Rules 2, 7, and 1, and is com- 
paratively weak on Rules 3, 4, 5, and 6 — especially the latter two. This 
gives Miss Jones information which will be useful in planning remedial 
work and drill for the whole class. 

At the side of the chart the individual scores are tabulated and the 
median is determined. By comparison with seventh-grade norms it is evi- 
dent that this class is quite deficient in knowledge and understanding of 
the rules of capitalization and will need thorough review and practice on 
fundamentals. Sometime later, another form of the test may be given to 
measure improvement. Since there are four forms of the test, the teacher 
may test and teach, test and teach, several times without giving the same 
test twice. 

Equivalent forms are very advantageous in diagnostic tests, perhaps 
even more desirable than with standardized tests in general The most 
widely accepted method for the use of diagnostic tests calls for testing, 
remedial instruction, retesting, further remedial instruction, etc., until 
adequate mastery is attained. In this process it is highly desirable to have 
available several equivalent forms of each test so that the second and sub- 
sequent testings may be done with different forms covering the same ob- 
jectives. This avoids the repeated use of exactly the same questions, and 
thus greatly reduces or practically eliminates gain in scores due to familiar- 
ity with the questions. 

It should be emphasized that a diagnostic test does not necessarily re- 
veal the causes of weaknesses. A diagnostic test in arithmetic may reveal 
certain specific deficiencies, let us say, in the multiplication of two-place 
numbers by two-place numbers, as 87 X 34. These weaknesses may be 
the result of a lack of correct knowledge of multiplication tables, or of 
carrying, or of adding, or of some other process. But knowing that a pupil 
does not perform one or the other of these operations correctly is no guar- 


t remedial work and drill on the specific processes will produce 
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g may be due to various 


the desired improvement. Deficiencies in learnin; 
causes such as defective hearing or vision, poor home conditions, unsatis- 
factory relations with classmates or teacher, lack of ability, and so on. In 
all diagnostic and remedial work it is essential to identify the basic causes 
of deficiencies, and to work on those. Otherwise, remediation is likely to 


be a waste of time and energy. 
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4. How does a diagnostic test differ from a conventional achievement test? 
Answer by comparing a standardized achievement test, perhaps in fundamentals 
of English, with one like the Pressey. 

5. If you were selecting a test for diagnostic purposes in fundamentals of multi- 
plication in arithmetic, what would you look for? 

6. Why are there more diagnostic tests available for elementary school than for 
high school or college? 


COUNSELING AND GUIDANCE 


Broadly stated, the function of the counselor or guidance worker is to 
help pupils achieve satisfactory and satisfying solutions to their problems. 
Darley,’ as a result of a survey of high school seniors in a large city, has 
classified these problems in a descending order of frequency, as follows: 
(1) vocational, (2) educational, (3) social or personal, (4) financial, (5) fam- 
ily adjustment, and (6) health. The emphasis might be different at lower 
grade levels or in different communities, but the categories seem to en- 
compass the major areas of concern to adolescents, as well as to people in 
general. 

The most common problems under these respective categories were: 
(1) discrepancy between vocational goal and abilities, (2) discrepancy 
between educational goal and abilities, (3) feelings of inferiority, (4) too 
much outside work, and inadequate finances, (5) family conflicts over 
educational and vocational plans, desire for independence, personality and 
age differences, and (6) poor health. Under educational problems (2) 
should probably be added those related to the choice of subjects (such 
problems, for example, as whether to take algebra or homemaking), and 
those related to the choice of a curriculum or course of study. 

To be successful, any counseling program should include the systematic 
and intelligent use of tests and other measuring instruments. It is difficult 
to imagine how a guidance program could function without the use of 
measures of intelligence, achievement, aptitudes, interest and personality 
ratings, observation, interviews, etc. Where discrepancies exist between 
goals and abilities, intelligence test results tactfully and confidentially dis- 
cussed with the pupil may help to bring about a readjustment of plans more 
closely in line with his talents. Marks and test scores representing the 
student’s achievement in school work will often bring a student to face 
realities and help him decide that perhaps he does not want to be an 


* John G. Darley, Testing and Counseling in the High School Guidance Program 
(Chicago: Science Research Associates, 1943), pp. 140-41. 
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electrical engineer after all, in view of the advanced mathematics and 
physics required. "Together with the results of measures of general aca- 
demic ability, such data, when discussed sympathetically with the pupil, 
and perhaps with his parents, have great value in bringing about an ac- 
ceptance of educational and vocational goals that are more consistent with 
abilities. Such counseling procedures often avoid frustration and unhap- 
piness for students and parents. 

When a pupil seeks guidance in educational or vocational matters, tests 
of aptitudes and interests provide useful tools. The nature of such in- 
struments has already been discussed in Chapters 10 and 11, and the ap- 
plications of these instruments in guidance and counseling are obvious, 
particularly with reference to educational and vocational decisions. 

An aptitude test is a useful tool in counseling if the results are properly 
understood and used. A high score on a test of mechanical aptitude does 
not guarantee success in an occupation requiring mechanical ability any 
more than a good score on a test of musical aptitude guarantees that the 
person who made the score will become a great musician. Many other 
factors besides aptitude enter into success; interest, effort, persistence, and 
opportunity all contribute. Yet the counselor can encourage the person 
making a high score on an aptitude test to ihe extent of saying that he 
seems to have the talent if he will develop it, and can help him to determine 
whether or not he has the other qualities needed for success in the field in 
question. 

On the other hand, the counselor can speak with more assurance in the 
case of a low score on the test. It is safe and pertinent, in such instances, 
to point out to the individual that statistics show that perhaps only one 
person in ten with his score is successful in this particular vocation. The 
choice of a life work should certainly not be based on the results of a single 
aptitude test, yet the information this provides may be the deciding factor 
when all other data concerning the particular occupation and person do 
not seem to point to a clear-cut decision. 

The interest inventory provides a useful supplement to other kinds of 
tests. It should be emphasized again, perhaps even more strongly than in 
the case of aptitude tests, that a particular pattern of interests or prefer- 
ences is not in any sense a guarantee of success in a given field. All one 
can say is that the interest score or pattern of the individual tested resem- 
bles or does not resemble that of successful persons in a particular occupa- 
tion or field of work. j 

Interest tests are widely used, especially at the senior high school and 
The results of such tests, when carefully studied and 
have considerable usefulness, particularly 
to his own potentialities and causing 


junior college level. 
discussed with the counselor, 
in giving the student more insight in 
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him to think more carefully about his own decisions and choices. Used in 
conjunction with the results of other tests, and information about home 
background, financial status, scholastic record, and health, they help to 
round out the picture of the individual. 

Student problems frequently occur also in the areas of adjustment and 
personality. The survey reported by Darley? revealed that feelings of in- 
feriority, lack of confidence, and personality clashes in the home were 
among the most common problems of high school seniors. The use of per- 
sonality measures will sometimes reveal the existence of such problems 
when interviews and other means have failed to bring them to light. 

Personality measures of various types may be used with groups to iden- 
tify more or less serious cases of maladjustment. In every hundred pupils 
there will be, on the average, from two to five who need help and who 
should be referred to a clinic or to someone trained to deal with such cases. 
Tests of this nature can also be used by counselors to shed additional light 
on the cases of pupils who come for advice and help, either on personal 
problems or on educational or vocational problems. With the pupil who is 
having difficulties in school or at home, a personality test often gives a 
helpful insight into the situation. The test may be supplemented by in- 
terviews, anecdotal records, ratings, and other information, all of which will 
usually provide a basis for understanding and help. 

Personality tests may be used to help a pupil make the most of his own 
talents, and they will usually help him get along better with others and 
attain a happier life by giving him and the counselor a better insight into 
his emotional and social behavior. In short, these tests help to give the 
individual a better understanding of himself, and will help explain why he 
behaves as he does in certain situations. 

Concerning financial problems and poor health, the counselor will obtain 
information about problems of this nature through means other than tests. 
These difficulties, particularly the health problems, usually require expert 
advice and treatment, and the responsibility of the counselor is mainly to 
identify such cases and refer them to the proper authorities. There may 
be many things a counselor can do to help a pupil earn money in after- 
school hours, or to help the one that is working too many hours outside, 
but these are not matters in which tests will generally play a significant 
role. Yet a knowledge of a student's /.Q. may be useful in helping the 
counselor and the student decide how much work he can take on to earn 
money without doing an injustice to his school work. 


8 Ibid. 
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7. What are the major types of problems of adolescents that a counselor is 
likely to encounter? 
8. As a high school counselor, what types of measuring instruments would you 
expect to find most useful? Give reasons for your choices. 
9. Is there a place for counseling and guidance in the elementary grades? If 
so, what measuring instruments should prove useful? 
10. What uses would a counselor have for achievement test results? 


MARKING 


It was stated early in this book that measurement and evaluation are a 
part of the job of teaching. Every teacher has the responsibility for mak- 
ing the best judgments he can about his pupils’ achievement and develop- 
ment in subject matter, maturity, citizenship, character, and in other areas. 
These judgments may be expressed in various ways, but marks are 
the commonest. As our schools and educational programs are constituted, 
marks are an integral part'of the system. Pupils, parents, and admin- 
istrators expect them. They are the terms in which a pupil's accomplish- 
ments are evaluated. It is therefore only sensible for the teacher to try 
to do the best possible job of evaluating and marking, to strive constantly 
to improve the marking system, and to do his best to keep abreast of im- 
provements in marking practices. 

That measurement has an important function in marking is self-evident. 
Teachers regularly use tests and examinations of their own devising as a 
basis for marking, especially when measuring achievement in subject mat- 
However, tests of capacity such as the intelligence test are also useful 
at they provide a basis for judging whether or not a pupil 
is working up to capacity. In some situations two marks are given, one 
expressing actual achievement in terms of A, B, C, D, or E, and the other 
in terms of S or U, expressing whether the pupil is doing satisfactory or 
unsatisfactory work in relation to his ability. Thus, one pupil of high in- 
telligence might get a B and a U in algebra, while another less able pupil 
might receive a C and an S. Inat least one community where this has been 
tried, the system has worked out to the satisfaction of all concerned.’ 

‘A perennial question in arriving at marks has to do with the propriety 
of using standardized tests of achievement for this purpose. It may be 
said at once that the use of such tests as the sole basis for marks is seldom 
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justified, since no standardized test is likely to provide adequate measures 
of all the outcomes of a course in a particular school or community. Few 
teachers would be satisfied to base the evaluation of their pupils’ accom- 
plishments solely or even substantially on standardized test scores. Never- 
theless, it is true that such tests may be useful in helping to arrive at a 
semester’s or a year’s mark in a given course. That is, when a standardized 
test is judged by the teacher to be an adequate measure of one or more of 
the locally determined goals of instruction, there would seem to be no valid 
reason for not using it, together with other measures, for marking pur- 
poses. A standardized test can well supplement or contribute in a useful 
way to evaluation based on the teacher's own measurement procedures. 
A goal of every teacher should be the best possible evaluation of each 
pupil's accomplishments. It is only good sense to make use of every 
practical device that helps achieve this objective. It may even be that a 
carefully constructed standardized test will at times provide a better meas- 
ure of certain outcomes of instruction than the average teacher can make 
for himself, as, for example, in the case of skills tests in elementary grades. 

We have already emphasized that the term measurement, as it relates 
here to marking purposes, includes a wide variety of instruments and 
techniques. In the discussion just preceding, we have described the use 
of achievement and intelligence tests for these purposes. In evaluating 
other kinds of accomplishment, however, different types of measurement 
will also be found useful. The school assumes some responsibility for the 
all-around development of each pupil to the extent of his capacity and desire 
to achieve. This includes not only achievement in subject-matter, but also 
development of character and personality, physical development and 
health, maturity in selecting and planning for his future vocation, choosing 
a life partner, and so on. In any of these areas the teacher may be called 
upon to express some judgment or evaluation of the pupil's status and 
development. In every one measurement has something to contribute. 
Personality measures, interest inventories, measures of physical develop- 
ment, and records of health and physical examinations are all useful in 
making evaluations, whether they are expressed in the form of a mark or in 
some other manner. Measurement can and should enter into every aspect 
of the process of evaluating pupil accomplishment and growth. 

The problem of what proportion of various marks to give — how many 
A's, B's, C's, etc. — is one that is often vexing to teachers. The conscien- 
tious instructor tries to do justice to all his pupils and at the same time 
attempts to conform with good principles of marking. There is no simple 
method to recommend that will satisfy both purposes. Recommendations 
on this point generally are based on the concept of the normal curve. In 
any group or class, unless it be a very small one numbering less than twenty, 
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abilities and achievement are likely to be distributed in a fashion approx- 
imating the normal distribution. If this assumption is appropriate, then 
the distribution of marks should approximate the proportions of the normal 
curve, which means that the largest proportion should be average, which 
is usually C. Smaller proportions will be somewhat above and below the 
average, and these would be marked B and D, respectively. Approximately 
equal and quite small percentages would be found at the upper and lower 
extremes, and these would receive marks of A and E or F, as the case might 
be. 

These principles can be embodied in a number of different systems or 
proportions, but one of the most widely used is based on the standard 
deviation, which is illustrated in Figure 17. 


Figure 17 


Distribution of Marks Based on Standard Deviation 


-2.50 -1.50 -.50 M +50 +1.50 +2.50 

Tt can be seen that the middle group, or C, extends one half standard 
deviation on either side of the mean, which area under the normal curve 
includes approximately 38 per cent of the total. One additional standard 
deviation beyond these limits on either side will include another 24 per 
cent in each case; and another standard deviation beyond the +1.5 « and 
~1.5 c, extending to +2.5 ø and —2.5 c, will include approximately 7 per 
cent more. The total 38+ 24+ 24 +7 -- 7 — 100.” 

Although measurements of most classes will not be distributed in exactly 
these theoretical proportions, they will quite generally approximate them 
if (a) the classes are not highly selected, and (b) the measures used are 
adequate for all levels of ability represented in the group. No teacher will 
want to force his distribution of marks into these theoretical proportions, 
particularly if he has reason to believe that the nature of the group or the 

i i infinity, so that within 2 
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total area under the curve. For all practical purposes, we may assume that 100 per 
cent of the cases or scores will fall between these limits. 
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measuring instruments used do not justify it. Nevertheless, the concept 
of a normal distribution as applied here can be a very useful guide to a 
teacher in making out marks and can help him to avoid giving marks that 
are clearly out of line with sound principles. 

What has been said so far and the facts that are known about the dis- 
tribution of human abilities and achievements provide a basis for some 
useful generalizations about marking and the use of measurement in mark- 
ing. A brief statement of these principles is given below. In reading and 
thinking about them the student is cautioned to remember that no hard 
and fast rules can be laid down for every teacher to follow in giving marks. 
Each teacher will have to use his own best judgment in his own situation, 
since marking, generally speaking, is a responsibility that cannot be shared 
with anyone else. The one unequivocal principle, if there be such, is that 
justice be done to every pupil insofar as possible, and that none be favored 
above any other. 

1. It is generally agreed that marks should be assigned on a comparative 
basis. That is, the best pupils should receive the highest marks, the next 
best, B’s, etc. In most school situations this practice is certainly preferable 
to setting some arbitrary and perhaps unrealistic standard and failing those 
pupils who do not attain it. 

2. Practice now generally favors the use of letter marks rather than 
percentages. Letter marks have several advantages: they are easier to use, 
easier to interpret, and are more realistic. In the first place, it is easier to 
mark a group with a five-point scale than with one having a hundred di- 
visions. Letter marks are easier to understand for the same reason. It 
may be said with confidence that not many people can make the fine dis- 
tinctions of judgment that percentage marks imply. It has been demon- 
strated that teachers can discriminate among five or six levels of quality 
or achievement, but not one hundred! 

3. Marks should be based as much as possible on objective measure- 
ments. Enough has been said in earlier discussions to show the unrelia- 
bility of teachers’ judgments of essay examinations. It may be assumed 
that what was demonstrated in those experiments applies with equal force 
to other subjective judgments or processes. It may not be possible to find 
or construct objective measures of all desirable outcomes, but the aim 
should be to move constantly in this direction and to increase the objectivity 
of our measurements as much as possible. In fairness to the pupil and for 
the teacher's own peace of mind in evaluating his pupils’ accomplishments, 
objectivity is a goal that should be worked for constantly. 

4. As far as possible, marks should express accomplishment of specific 
goals rather than the results of global or omnibus appraisal, for a marking 
system which does not provide for some precise differentiation as, for exam- 
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ple, in the case of the dual system of marking described above, is of little 
value. Also, one should not attempt to combine in one mark achievement 
in subject matter and such other traits as courtesy, punctuality, or effort. 
Of course the measurement of these traits is important, but if the pupil is 
given a B in arithmetic the mark should denote his accomplishment in that 
subject. If desired, he may be given credit for courtesy, effort, and other 
important matters, but this should be expressed in separale marks. Other- 
wise, two pupils may receive B's in arithmetic, one of the marks representing 
achievement of A and effort of C, the other representing an achievement 
of C and effort of A. Obviously, to give them the same mark of B is an 
injustice to both pupils as well as to their next teacher, their parents, and 
any others who have no way of knowing what the marks really represent. 

This principle applies also to the appraisal of growth or improvement. 
Such outcomes should not be combined in one mark with status or level of 
accomplishment. To illustrate, let us assume that marks are being given 
to two girls taking typewriting. One starts at 10 words per minute and 
progresses by the end of the year to 40 words per minute. Another girl 
starts at 30 words per minute and goes to 50 words per minute. A made a 
gain of 40 words while B gained but 20. Yet B is a more efficient typist 
than A. Would it be fair to give A a better mark than B because she had 
made a numerically greater gain? It would not seem so. Certainly anyone 
seeking to employ a girl for typing would be misled if both received even 
the same mark. Here, as in cases already cited, it would be most accurate 
and fair to use two marks, one for status and another for improvement. 


Then the picture might be as follows: 


Purn Sratus — IMPROVEMENT 
A [^] B 
B B C 


Knowing these facts, the prospective employer or any other person 
concerned would be able to make an intelligent choice, assuming, of course, 
that other aspects of the two cases are equal. 

In substance, this principle states that marks should mean and stand 
for what they are intended to mean and stand for. If a mark is given in 
English it should represent accomplishment in English and, if possible, 
should be differentiated to designate accomplishment in composition, 
ure, English literature, or some other specific course or 

Tf evaluations of other important qualities such as effort, 
ts and the like are desired, these should be re- 
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ctive measurements should contribute 
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5. Finally, better marking can be attained by using a wide variety of 
measures, The more measures of a pupil’s achievement that are employed 
and the more varied in approach and design they are, the better the 
sampling of his behavior is likely to be. Eyen if the tests are quite similar 
in nature the combined results of several of them should give a more ac- 
curate appraisal than any one of them alone. This is simply an application 
of the principle of sampling: the more samples taken the more accurate the 
measurement will be, always providing, of course, that there is no consistent 
bias operating. Furthermore, by use of a variety of measuring instruments 
we are likely to obtain a wider sampling than we otherwise would. In 
measuring the results of instruction in civics, for example, a teacher may 
find it desirable to use not only tests, but also rating scales, anecdotal 
records, and systematic observations of behavior — all of these contrib- 
uting to the appraisal of achievement in civics. 
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11. State your viewpoint on the place of marking in the schools, How would 
measurements enter into it? 

12. Name some situations in which selection of persons completing training on 
the basis of improvement alone, without reference to level of skill attained, would 
be dangerous, e.g., airplane pilots. Name some in which amount and nature of 
improvement would be more important than level of proficiency. 

13. The following forty-nine scores on a general science test are taken from 
Appendix A: 71, 68, 67, 61, 60, 58, 58, 55, 54, 52, 50, 49, 47, 47, 46, 45, 44, 44, 44, 
43, 43, 43, 42, 41, 41, 40, 40, 39, 38, 38, 38, 37, 36, 36, 36, 33, 33, 32, 28, 27, 25, 24, 
22, 21, 18, 15, 13, 8, 3. Their mean is 40 (39.8) and their standard deviation is 
15 (15.1). Using these values, assign a mark to each score by the method shown in 
this chapter. What proportion of A, B, C, D and E does this yield? How closely 
does this conform to the theoretical proportions? Explain the reasons for any di- 
vergencies. 


MOTIVATION 


Tt is generally assumed that the prospect of taking a test motivates 
pupils to study. There is no doubt that most people do some study and 
preparation, if they can, when faced with the prospect of taking a final 
examination in a course. There are other aspects of the problem of moti- 
vation, however. For example, there is the question of whether pupils do 
better on a final examination if they have had occasional tests during the 
term or semester than they would without having had such periodic tests. 
The question is complicated by various factors such as the kinds of periodic 
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tests used in relation to the kind of final examination given, whether 
students are simply told their scores on the tests or are permitted to go 
over them afterwards, and whether the tests are announced in advance or 
are given without warning. 

The question of motivation is not as simple as it may first appear, and 
it is likely that many teachers and others assume the motivational value 
of tests without giving much thought to the various problems involved. 
It is a generally accepted principle of psychology that practice of a skill 
with knowledge of results, that is, of errors, successes, and over-all 
improvement, results in much more progress than practice wherein such 
information is withheld from the learner. Indeed, there is evidence that 
practice of a simple skill like drawing a straight line just two inches long 
without knowledge of the accuracy of the preceding efforts or trials does 
not bring about any real improvement in accuracy. In other words, 
practice under such conditions — far from making perfect — does not 
even result in slight improvement. It is a widely accepted fact that only 
when the learner knows of his errors and knows when he performs well does 
practice bring about noticeable improvement. 

There is considerable experimental evidence on the effect on achieve- 
ment of occasional testing as measured by success on final examinations. 
In most such experiments two groups, equal in ability and previous prep- 
aration, are formed. Both are taught by the same teacher using the same 
materials and methods. Both take the same final examination. The only 
difference is that one group, which may be called the experimental group, 
has tests at intervals throughout the semester or term while the control 
group has no such tests. Any difference in achievement as measured by 
the final examination can then be ascribed to the single variable of periodic 
tests. The results of these experiments indicate on the whole that the 
periodic testing makes no measurable difference on the final examination," 
It seems that the amount learned by students as reflected by achievement 
on final examinations is not appreciably affected by the use of periodic or 
occasional written tests. 

While many experiments have been conducted along the lines just in- 
dicated, there seem to be few reports of studies where the effects of regular 
testing on problem assignments, as in the teaching of arithmetic, have been 
investigated. One study of this nature" conducted with pupils in fourth- 
grade arithmetic provides some light on this question, Two groups of 
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pupils in fifty-six different classes were used as the subjects. A total of 
358 were used as the experimental group and they were matched with 358 
used as controls. The instruction of both was the same except that the 
experimental group was encouraged to compare their achievement on drill 
units with a set of standards, while the control groups were not given ac- 
cess to such standards. On the final test the experimental group made an 
average score substantially higher than that of the control group. These 
findings are in agreement with what would seem like a reasonable expecta- 
tion. When pupils can go over their test papers and find out what they 
did wrong and why, it would appear safe to assume that many of them, if 
not all, would not make the same mistakes again. 

Teachers will probably go on using tests, particularly those of their own 
devising, at least partly to stimulate pupils to greater achievement, in the 
belief that the tests function in that way regardless of the findings of ex- 
perimental studies. In this connection, it must always be remembered 
that situations and circumstances differ from one teacher and class to 
another, and that the use of tests for motivational purposes will depend 
on the individual teacher's judgment and experience of what is effective 
for him and his students. 

"The use of tests for motivation is confined largely and quite naturally 
to achievement tests. "There is little that an individual can or should do 
beforehand in trying to improve his score on tests of intelligence, aptitude, 
interests, or personality. On those, we are interested in motivating the 
person tested to put forth his maximum effort at the time of testing, and 
we are interested also in eliminating any coaching, study, or previous 
knowledge of the test which might give a pupil an unfair advantage and 
which might result in an inaccurate and misleading measurement. of that 
pupil's ability. Achievement tests have as their basic purpose the measure- 
ment of the results of teaching, and anything that can be done legitimately 
to improve such learning is desirable. Therefore, if periodic testing serves 
to stimulate interest and motivate the pupil to greater effort and accom- 
plishment, there is justification for the use of tests for that purpose. 
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14, Would you expect the effect of tests announced in advance to be the same 
as that of unannounced tests? Give reasons for your answer. 

15. Should a teacher go over standardized tests of achievement, after they have 
been scored, with pupils in order to discuss questions missed? If so, under what 
circumstances? 
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IDENTIFICATION AND STUDY OF EXCEPTIONAL CHILDREN 


The majority of pupils in our schools fall within what may be thought 
of as the normal range. However, in nearly every class or group there are 
those who are outside this range in some respect. They may be exception- 
ally bright or dull; they may have more or fewer than the usual number 
of adjustment problems, or their problems may be unusually severe or 
mild; some may be exceptional in physical qualities — either unusually 
gifted or perhaps handicapped in some way that interferes with normal 
participation and success in activities of various sorts. Such children are 
referred to in educational literature as exceptional children. 

In order that suitable educational provisions may be made for exceptional 
children, it is essential to have a program of measurement. This program 
should be designed to identify, study, and diagnose, and to measure im- 
provement or change in the exceptional child. Much of this testing will be 
individual, but group tests may also be used to advantage, particularly 
in the earlier or preliminary phases of the work. Thus, group tests of 
achievement, intelligence, and personality will generally serve to identify, 
by either unusually high or low scores, certain children who may, tentatively, 
at least, be noted as exceptional. School surveys using standardized tests 
will usually reveal a number of such children who may require further study 
and special attention. Tests are also useful in diagnosing the nature and 
causes of exceptionalities. Achievement tests, for example, particularly 
diagnostic ones, may serve this purpose. 

There are some group tests for identifying physical handicaps, such as 
the 4-A Audiometer (Western Electric Co.) for testing hearing. Each child 
is provided with a set of earphones and paper and pencil. Auditory signals 
— words and numbers in varying degrees of loudness — are sounded in 
the right and left ears separately through use of a phonograph record. The 
children record what they hear. Most tests for physical handicaps or 
individual tests and they are usually administered and 
ally trained and qualified personnel — physicians and 
psychologists. Teachers should be alert to evidences of physical defects 
and poor health, for they are often the first ones to identify children who 
have exceptional physical characteristics. s 

For use with physically handicapped children the typical paper-and- 
pencil test of school achievement must often be modified in some way. 


Larger print for the visually handicapped, oral or Braille instructions for 
blind children, modification of directions for the deaf or hard-of-hearing, 
ns of test procedures for use with crippled children 


and various adaptatio d 
represent ways in which tests have been modified for use with the handi- 


exceptionalities are 
interpreted by speci: 
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capped child. Individual tests such as the Stanford-Binet have also been 
modified for use with handicapped children. 

Tests may also be very useful in measuring results of special programs. 
Usually after an exceptional child has been identified and his case diagnosed 
there is an attempt to do something about his handicap or particular gift. 
Special schools, special classes, remedial work, therapy of various kinds, 
and correction of defects by more radical means are used as each case may 
require. The amount of improvement — educational, psychological, or 
emotional — may be gauged by the use of suitable tests, either tests de- 
signed for the purpose or modifications of existing tests. 

In working with exceptional children it is only natural that the handi- 
capped should receive the most attention. A child who is blind, deaf, 
feeble-minded, or crippled naturally calls forth sympathy and assistance. 
This is as it should be. However, there is another type of exceptional 
child, the gifted child, who is often neglected and given even less attention 
than a normal child. Because a child is gifted it is easy to ignore him or 
leave him to his own devices while energy and help are directed toward the 
others. The gifted child can usually be depended upon to “come through” 
without much help from anyone. Not to give him special attention and 
encouragement is, however, a short-sighted policy. Our future leaders in 
science, industry, and public affairs will come largely from these gifted chil- 
dren, and it is to society's advantage to help them in every way to make 
the most of their unusual talents. Not to do so is to risk depriving society 
of great benefits. There is considerable evidence that schools and society 
in general are becoming increasingly aware of the importance of giving 
special attention to gifted children, and that they are now providing rich 
opportunities for these children to develop their unusual gifts to the 
fullest." 

The development of standardized intelligence tests and, to a lesser de- 
gree, standardized achievement tests has made it possible to identify and 
measure more accurately than ever before these gifted children in our 
schools. Personality tests have made it possible to study their personal 
and social characteristics and have demonstrated that, contrary to belief 
among some persons, children who are gifted intellectually are usually 
normal and well-adjusted, at least as much so as the general population. 
The use of tests with gifted children has opened a whole new field of study 
and research which should result in benefits of considerable importance 
to society as a whole. One example of the practical application of such 


18 Elise H. Martens, Curriculum Adjustments for Gifted Children, U.S. Office of Edu- 
cation Bulletin 1946, No. 1 (Washington, D.C.: Government Printing Office, 1946); 
also, Arno Jewett, J. Dan Hull, ef al., Teaching Rapid and Slow Learners in High School, 
U.S. Office of Education Bulletin 1954, No. 5 (Washington, D.C.: Government Printing 
Office, 1954). 
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techniques is the annual nation-wide testing of high school seniors for the 
selective service. Those who demonstrate the required level of competence 
on these tests are deferred from compulsory military service until they have 
completed college or university, provided they continue to make satisfac- 
tory progress. 
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16. What kinds of measurements would you use to identify gifted children? 
How would you decide, on the basis of scores on each type, what your criteria for 
“gifted” would be? 

17. Make a list of various kinds of handicaps found among school children. 
Cite one or more types of measurement useful in each case. 

18. What adaptations in a paper-and-pencil test must be made for (a) the deaf, 
(b) the blind, (c) the feeble-minded, (d) spastics? 


INTERPRETING SCHOOLS TO THE COMMUNITY 


There seems to be a growing interest, on the part of the public, in the 
activities and problems of the schools. The verb “seems” is used because 
it is evident from many records and from the history of education in this 
country that parents have always taken an active interest in the schools, 
at least as long as their own children were enrolled. With improved com- 
munication and transportation the public is probably better informed 
about the schools today than at any previous time. "This, of course, is 
highly desirable; the more the public knows about the schools the better 
its understanding and support is likely to be. 

Letters to parents, reports to the public through meetings, discussions, 
and the press, can explain and justify the aims of the schools and give the 
public an opportunity to react to the ideas presented. "Tests and other 
kinds of evaluative instruments and techniques can also be very useful in 
explaining what the schools are trying to accomplish, and they are par- 
ticularly useful in showing how well these purposes have been realized. 

To illustrate how tests have helped to interpret the schools in one com- 
ey was conducted by lay committees with the help of 
ltants. Standardized tests of reading, 
arithmetic, and language arts were given in the primary grades and stand- 
ardized batteries in the upper grades; also, tests in English, social studies, 
science, and arithmetic were given in the junior and senior high schools. 
The results were compared with national norms and analyzed in relation 
to local goals, ability of pupils, and other factors. The findings were pre- 


munity, a self-surv: 
professional educators as consu 
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sented in pamphlet form and discussed by parents, teachers, and profes- 
sional educators at public meetings. As a result of this survey the com- 
munity had a better knowledge and understanding of its schools, and most 
of the citizens were much more interested in supporting a program of im- 
provement than they would have been otherwise. 

In recent years the schools have been criticized rather frequently on 
the grounds that they are neglecting fundamentals, that children are not 
learning to read, write, spell, and “cipher” as well as they used to, and 
that training in courtesy, industry, and punctuality is being neglected. 
]t is obviously difficult to prove or disprove such accusations. For one 
thing, the use of standardized tests of achievement is a fairly recent de- 
velopment, and without results of such tests from earlier periods objective 
comparisons are very difficult, if not impossible. In one investigation, 
however, it was possible to repeat certain tests that were given from fifty 
to seventy-five years ago.* In about half the instances the average scores 
of today's pupils were better than those of the earlier generation, and in 
the other half they were not as good. In 1845 children generally performed 
better on questions of rote memory and abstract skills, and less well on. 
thought questions, than children in 1919. On the whole, the children did 
at least as well in 1919 as those in 1845. 

As time goes on, it will be possible to make comparisons of the achieve- 
ment of pupils at any desired intervals. It is important not only that 
such studies measure. the achievement of pupils, but also that the results 
be interpreted in relation to measurement of the pupils’ intelligence and 
other factors that have a bearing on their achievement. It seems fairly 
certain, for instance, that comparisons of the intelligence of today's high 
school pupils with that of similar pupils of a century ago would show the 
latter to be a more highly selected group. Only a small proportion of 
children of secondary school age had the opportunity then to attend a 
secondary school, whereas today almost all children who desire it may go 
to school until they are at least part way through high school. Indeed, 
most states require school attendance until age sixteen, and this serves to 
get most children at least into, if not through, some type of secondary: 
school. Comparisons of achievement, no matter how good the tests, would 
be open to serious question unless it were possible either to assume equality: 
of ability or to submit evidence that it existed. 

Jt seems clear, therefore that tests and other measurement devices 
and techniques can be highly useful in interpreting the schools to the 
community. As more adequate measures of the great variety of objectives 
of our educational programs are developed, they should increasingly serve 


4 Otis W. Caldwell and S. A. Courtis, Then and Now in Education (Yonkers-on-Hud- 
son, N.Y.: World Book Company, 1925), Chap. 7. 
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to bring about better rapport and understanding between teachers, pupils, 
and parents. The respect of pupil and parent for the teacher and for the 
importance and complexity of his task will grow. At the same time, the 
teacher should be able to do a better job because the results of tests will 
give him a better insight into the nature of his pupils and will assist him 
in explaining what his purposes and his successes are. 


e Learning Exercises € 


19. Assume that you are responsible for public relations between the schools 
and the public in a community of 15,000. How would you use the results of meas- 


urement to interpret your program to the community? 
20. If you had to deal with a parent who took the attitude that all tests are bad, 


how would you proceed to change his viewpoint? 


IMPROVEMENT OF SCHOOL STAFF 


The use of tests and other measuring techniques can contribute sub- 
stantially to the professional development of teachers in several ways. 
In the first place, putting on a measurement program can result in pro- 
fessional growth through cooperative planning, organizing, and conducting 
of such a project. In order to participate actively, teachers must learn 
something about available tests, the characteristics of good measuring in- 
struments, methods of determining whether a test is a good one or not, 
and sources of information which will provide a basis for making such a 
determination. Some teachers must also learn how to administer, score, 
and interpret tests, and how to put the results to good use. All of these 
experiences and skills can come out of participation by teachers in a meas- 
urement program. 

Another way in which measurement contributes to staff improvement 
is through the construction of tests and other measures for local use. In 
this activity a teacher must identify objectives of instruction and try to 
construct instruments which will measure progress toward them. This 
will direct attention not only to the objectives, but also to methods and 
materials for attaining them. It will also bring the pupil into the picture 
since, in constructing tests, the teacher will constantly be thinking of ways 
in which to measure pupil changes resulting from instruction. 

After the results of measurement are known, these will promote pro- 
fessional growth in the teacher by revealing what has worked well in his 
instruction and what has not. An appraisal of the apparent effectiveness 
of his methods and materials will cause him to examine these with new 
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insight. Measurement programs may also help teachers to learn from each 
other by revealing individual strengths and weaknesses and by encouraging 
them to exchange ideas much in the same manner as housewives improve 
their cooking by exchanging their best recipes with one another. 

Educational measurements may also contribute to professional growth 
of school staff by giving the teacher better insight into the individual pupil’s 
capacities, interests, achievements, personality problems, and needs. Such 
improvement in understanding will almost certainly make the teacher more 
useful and effective as a teacher and as a friend and adviser. 

Knowledge and use of measuring procedures may contribute to pro- 
fessional growth of the school staff through the development of a better 
understanding of the problems involved in accurate measurement of human 
traits and greater appreciation of the efforts of pioneers in this area. Also, 
such study and investigation by teachers should help to bring about im- 
provement in their own tests and measuring instruments. The better the 
eyaluative procedures used by teachers and counselors, the more effective 
will be their teaching and counseling. 

Finally, the use of measuring instruments should be helpful to the ad- 
ministrator and supervisor in many ways. Tests may be used to help select 
personnel for teaching and other positions. Tests of pupil achievement, 
observations, and rating scales will be useful to the supervisor in the in- 
service education of teachers. Various measuring and evaluative devices 
such as check lists and score cards may also be useful in arriving at sounder 
judgments concerning physical facilities. 
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21. What are some ways in which a teacher may grow as a result of participation 
in planning and carrying out a measurement program? 

22. Should scores on achievement tests of pupils be used by supervisors in as- 
sisting teachers to improve their methods? If so, in what ways might this be done? 

23. Of what benefit to teachers might self-rating be? 


EDUCATIONAL RESEARCH 


Although research is not generally thought of as one of the major func- 
tions of the average school or school system, it is true that in a modest sense 
most Schools perform research of one kind or another. In many of these 
research activities tests play an important role. If comparisons are desired 
between grades, schools, or systems, tests provide an objective, reliable 
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basis for making them. In fact, standardized tests are the only type which 
permits local: comparisons and also comparisons with national norms. 

In school surveys tests are useful tools for studying such problems as 
the grade placement of pupils, achievement in basic fundamentals, the 
relationship of the offerings of the school to the needs of the community, 
and the degree of success attained in realizing the educational goals of the 
school or of the community. Measuring instruments are not available yet 
for all of the educational goals a school or community may set for itself, 
but the use of most existing instruments is comparatively simple. This is 
true, for example, of educational goals expressed in terms of subject mat- 
ter and, to a lesser extent, of such goals as attitudes, desirable habits of 
work and study, and participation in school and community activities. 
In other areas such as social adjustment or citizenship it may be necessary 
to devise original measures. This in itself is a worthwhile type of research 
activity for teachers, particularly if they have the expert advice of spe- 
cialists. 

Tests may also serve research purposes in the schools in conjunction 
with.comparative studies of different, methods of teaching. Most teachers 
are keenly interested in finding the most effective ways of doing things, 
not only because of the improved efficiency and consequent saving of time 
and energy, but also because better methods result in better learning or 
achievement by the pupil. A common type of educational research is to 
form two equated or equivalent classes or sections and to compare the 
relative effectiveness of two methods of instruction, one class being taught 
by one method, the second by the other method. If the two groups are 
equal at the beginning, then any difference at the end may be ascribed to 
the differences in method, provided, of course, that all other factors that 
might affect the results are held constant. In all such experiments measure- 
ment plays an important part. It is used to measure and equate the status 
of groups before the experiment is begun, and to measure the results after 


it is completed. 


e Learning Exercises * 
24. List three examples of research in which classroom teachers might engage 


and which would require the use of measurement. 
blems of counselors differ from those of 


25. Would research interests and pro! selon 
classroom teachers? If so, give some examples of possible interests and problems 


of each group and indicate what kinds of tests they would utilize. 
26. The larger school system generally has a bureau of research. What meas- 


urement functions and activities would such an organization perform? 
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Further Statistical Computations 


ee 


The problems in this appendix utilize as far as possible the data pre- 
sented in Chapter 3. The chief purpose of this section is to supplement 
Chapter 3 by providing an opportunity for the student to learn and prac; 
tice the actual steps in computing the usual statistical measures, and by 
giving him a deeper insight into the meaning and significance of these 
measures in interpreting the results of measurement. 


FREQUENCY DISTRIBUTION 


Ordinarily, a teacher works with classes or groups numbering between 
52 and 40 pupils. In most such cases the scores can be handled individually 
without any special arrangement or grouping. However, it is often ad- 
vantageous to arrange the scores in some systematic order or grouping; 
where the number of scores or cases is large, perhaps 50 or more, such 
grouping is practically a necessity. 

A frequency distribution is merely a method of arranging scores into 
y are generally called, for ease in handling 
1 work is done with scores arranged in 
helpful to be able to construct 


groups, or class intervals, as the 
the figures. Since most statistica 
such a frequency table or distribution, it is 
or read one. 

On the arithmetic test cited in Chapter 3 (page 35), the scores of John's 
class were as follows: 44, 21, 14, 18, 46, 45, 52, 30, 39, 36, 31, 22, 23, 38, 
33, 33, 29, 38, 32, 29, 42, 28, 26,33,25. These were arranged in order from 
the highest to the lowest, producing this sequence: 52, 46, 45, 44, 42, 39, 
38, 38, 36, 33, 33, 33, 32, 31, 30, 29, 29, 28, 26, 25, 23, 22, 21, 18, 14. 

John's score of 36, as we know, was ninth in the class; the arithmetic 
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mean was 32.3 and the mid-score was 32. These results are easily obtained 
without any further rearrangement of the scores. However, we can make 
a frequency table from these scores by the following steps: 

1. Choose some convenient class interval, say five. If the range of 
scores is small (the range is the difference between the highest and lowest 
scores), use a smaller class interval; if the range is large, use a larger inter- 
yal, perhaps ten. In any case, use an interval large enough so. that the 
table will not be too long for convenience in working with it and yet small 
enough to represent the scores with reasonable accuracy. In this case we 
have used an interval of five. 

2. Make a table of class intervals that will serve to include all scores 
in the class or group. In general, the class interval used should be of such 
a size as to give a distribution containing not less than eight nor more than 
sixteen intervals. By using the interval of five in the example below, we 
establish nine such categories. 

3. Tally the scores one by one in the proper class intervals. 

4. Add the tallies and write the sum in each interval. These are called 
frequencies. The total of all of the frequencies (V) gives the number of 
students tested. 


Table VIII 


Frequency Distribution 


Class Intervals Tallies Frequencies 
eee 


50-54 
45-49 
40-44 
35-39 


30-34 
25-29 
20-24 
25-19 
10-14 
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Although we have gone through the essential steps in setting up a fre- 
quency table, we have not considered adequately some basic questions 
underlying this method. To use the method correctly and intelligently 
these questions must be considered and answers agreed upon. The first of 
these concerns the class interval. 


Limits of Class Intervals 


A score on a test is usually a whole number, such as 22. Generally we 
do not deal with fractional scores in educational measurement. However, 
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it is necessary to give some consideration to the actual value of a whole 
number. For example, we can consider the score of 22 as representing a 
range of all possible values from exactly 22.000 up lo bul not including 
23.000. In this case the score of 22 should really be written 22.500, since 
that would be the most probable value if we measured accurately to three 
decimals. 

On the other hand, we can consider the score of 22 as representing a 
range of all possible values from 21.500 up to bul nol including 22.500. In 
this case the score of 22 is taken to mean 22.000, which is the most probable 
value. The latter concept is the one most generally favored in statistical 
work. 

The use of three decimal places here is arbitrary. Exactly 22 would mean 
22 followed by an infinite number of zeros; exactly 224 would mean 22.5 
followed by an infinite number of zeros. When we write 22.00 or 22.000 
we assume the rest of the zeros if we mean exactly 22, The same is true in 
the case of 22.50 or 22.500. 

The same principles apply in the interpretation of the limits of class 
intervals. If we are dealing with a class interval of five we may indicate 


this in several different ways: 


25-30 25-29 24.5-29.5 
or or 
20-25 20-24 19.5-24.5 


The first way is the least desirable since the limit, 25 appears in two suc- 
cessive intervals and may thus lead to errors in tabulation; the second has 
the advantage of simplicity over the third, and does not have the obvious 
fault, of the first; the third, while the most exact in statement, is cumber- 
econd method is therefore recommended with the admonition 
that the value of a score be remembered as explained above. Then the 
interval 20-24 really means from 19.5 to 24.5 (19.500-24.500). This con- 
stitutes an interval of five which contains all whole number scores of 20, 
21, 22, 23, and 24, or any score from 19.5 up to, but not including, 24.5. 


some. Thes 


Mid-point of Intervals 


In statistical work with frequency 
use the mid-point of an interval. There are two 8 


distributions it is often necessary to 
teps in determining the 


mid-point: 
1. Find one-half the class interval. 
2. Add this to the actual lower limit or subtract it from the actual upper 


limit of the interval whose mid-point is desired. 
Let us take as an example the interval 20-24. The interval is, of course, 
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five, and so halfway through it would be 2.5. Then if we begin at the 

upper (24.5) or lower (19.5) limit of the interval and go 2.5 steps or score 
oints we get 19.5 + 2.5 = 22.0, or 24.5 — 2.5 = 22.0. This must be the 


mid-point since it is equidistant from the upper limit and the lower limit 
of the class interval. 


e Learning Exercise © 


1, Find the mid-points of the following class intervals: (a) 50-59; (b) 27-29; 
(c) 30-35; (d) 13-14; (e) 96-101. (59 means up to, but not including, 60; 29 
means up to, but not including, 30, etc.) 


. One basic assumption is made in working with mid-points of class in- 
tervals. The interval 20-24 has a mid-point of 22; if nine cases fall in this 
interval, we assume that these nine scores are evenly distributed through- 
out the interval, or, more generally, we assume that the average of these 
nine scores is equal to the mid-point of the interval. 

What we have said concerning class intervals can be presented graphi- 
cally as follows: 


x 
x x x 
x x x x x 


19.5 20.0 20.5 210 21.5 22.0 225 230 23.5 240 245 


| The actual or real limits of the interval are shown at the ends of the line, 
the mid-point, 22, at the center; the nine scores in the interval balance so 
that any average of them would give 22. Any distribution of the nine 
scores which gives an average equal to the mid-point of the interval will 
satisfy the assumption. In actual practice this assumption tends to be 
reasonably well met. This is particularly true when the number of cases 
is large and the class interval chosen is fairly small. "The larger the class 
interval and the smaller the frequencies, the greater are the chances of in- 
troducing error. It is also likely that error introduced as a result of cases 
piling up at one end of a particular interval will be balanced by an opposite 
tendency in another interval, the two sources of error thus tending to 
balance or neutralize each other. 


Making a Frequency Distribution 


A simple frequency distribution was shown on page 388, with a brief 
statement of the steps involved in making it. Let us now take a series of 
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Computations 


scores and carefully work through the steps required to make a frequency, 
distribution of them. The following scores represent actual scores of 49 


pupils on a general science test: 33, 42, 


47, 61, 43, 52, 71, 21, 43, 37, 60, 43, 


54, 68, 13, 50, 38, 40, 67, 3, 45, 47, 49, 58, 38, 46, 58, 36, 44, 55, 15, 38, 44, 


40, 28, 27, 44, 36, 41, 39, 22, 36, 18, 41 
1. Choose a class interval of suitabl 
finding the range of scores (here it is 
range by a convenient class interval to 
between 8 and 16. In this case: 


, 24, 32; 8, 33, 25. 

e size. This is usually done by (a) 
71 — 3 = 68), and (b) dividing the 
see if it gives a number of intervals 


68+ 4= 17 

68 + 5= 13+ 
68 + 6= 11+ 
68+7= 9+ 


In practice we seldom use class intervals of 4, 6, or 7. The most commonly 
used intervals are 2, 3, 5, 10, or, if necessary, 20. Here we shall use 5 as our 


class interval since it i 
class intervals. 


s of a convenient size and it actually gives fifteen 


2, Next, set up a frequency table designating the class intervals. Note 
that we have chosen the lower limits of our intervals in such a way that 


they are multiples of the interval size, 
here for the sake of convenience, but i 
tistically 
authorities recommend choosir 
points of the intervals 
natural way of thinking 
lustrated: 


A. Limits ARE MULTIPLES 
OF THE INTERVAL 


Limits Mid-points 
25-29 21 
20-24 22 
15-19 17 


Since our scores r 


tem A, above, 
for the complete range of scores fro 


tween. 


to use limits which are not multiples of the interval. 
ng the limits in such a way that the mid- 
are multiples of the interval, but this 
about class intervals. The two methods are il- 


ange from 3 to 70 we 
will include these extremes and all possil 
we have the series of intervals giv! 
m 3 to 70 an 


that is, of five. We use these limits 


t would be just as satisfactory sta- 
Some 


seems a less 


B. MID-POINTS ARE MULTIPLES 
or THE INTERVAL 


Limils Mid-points 
23-21 25 
18-22 20 
13-17 15 


will need a series of intervals that 
ple scores in between. Using sys- 
en below, showing intervals: 
d all possible scores in be- 
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3. Tally the actual scores in the proper class intervals, adding the 
tallies in each class interval. These are the frequencies. 


Crass INTERVALS (c.i.) FnEQUENCIES (f) 


70-74 1 

65-69 11 

60-64 11 

55-59 111 

50-54 111 

45-49 Ht 

40-44 Hi Ht 1 
35-39 44 111 
30-34 111 

25-29 111 

20-24 111 " 
15-19 11 

10-14 1 

539) 1 

0-4 1 


m= 
M RB FR NWwWwWwworunwwnndr 


N49 

The sum of the frequencies (V) should be equal to the number of scores 
in the group. This is a rough check on the accuracy of the tabulation. 
However, the most common error in doing this task is the tabulation of a 
score in the wrong class interval. It is easy to make this mistake, and the 
only way to detect such errors is to tabulate the scores twice and see whether 
the frequencies in each class interval check. The second tabulation may 
be done alongside the first one or by placing a dot over each tally when 


going through the second time, thus: 35-39 Hit 111. 
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2. Make a new frequency table using the 49 scores on page 391, but with a class 
interval of three. 
3. Make a frequency table of the reading test scores given in Chapter 3, page 42. 


MEASURES OF CENTRAL TENDENCY 


Mean 


In Chapter 3 the mean and median were discussed and calculated with 
25 ungrouped scores. Let us now see how the mean and median are de- 
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termined from a frequency distribution of the 49 scores on the science test 
on the preceding page. 


Table IX 


Calculation of Mean Using 49 Scores on Science Test 


EJ 
Qa 


M= AM + (ZE x ei) 


on 


M= Mean 
A.M. = Assumed mean 
Zfd = Algebraic sum of devia- 
tions about assumed mean 
N = Number of scores 
c.i. = Class interval 


AwWwWNN— 
= 


m= 420+ (A= x 5) 


~ 204 (32x5) 


- 20+ (22) 


= 42.0 — 2.2 


1 
8 
3 
3 
3 
2 
1 
1 
1 


= 
I 
A 
o 


= 39.8 


The steps illustrated in Table IX are as follows: 

1. Select an arbitrary origin. It is generally best to select a point near 
the center of the distribution, although any interval may be used without 
affecting the result. Here we chose the interval 40-44 whose mid-point, 
42, we call the assumed mean. 

2. In the d column mark off steps by intervals above (+) and below (—) 
the A.M. (d= deviations from the interval containing the assumed mean, 
in units of the class interval). 

3. Multiply these steps by the frequencies in th 
and enter these products in the fd column. 

4. Add the positive fd's (44) and the negative fd's (—66) separately. 
Algebraically add the +fd’s and —fd's to find fd. The Zfd divided by N 
gives the correction, that is, the amount expressed in units of the class 
interval by which our assumed mean differs from the actual value or mean. 


5. The formula M= A.M. + Gt x ei) simply converts the correc- 


e class interval to score units and applies this correction 
giving us the corrected value for the mean. 


e respective intervals 


tion from units of th 
to the assumed mean, 
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qx Check the value 39.8 (Table IX) by adding the 49 scores and dividing by 49. 
Is there a difference? Why? 

1 5. Check this value further by assuming the mean to be in some interval other 
than 40-44, and re-calculating the mean. Does it agree with the value already 
obtained, namely, 39.8? 

* 6. Calculate the means of the scores on the arithmetic test and the reading test 
using the frequency distributions shown in Table VIII, page 388, and prepared in 
Learning Exercise 3, page 392. 


Mid-score and Approximate Semi-interquartile Range 


Where data are not grouped into class intervals of a frequency distribu- 
tion it is often sufficiently accurate to use the middle score of the series as 
the average, and approximate values of the third and first quartiles in 
finding the semi-interquartile range (Q) as a measure of variability. When 
more exact results are required, and especially when frequency distributions 
are used as a basis for calculation, the mean (and median) and standard 
deviation should be calculated. 

In Table X are shown methods for obtaining approximate medians and 
Q’s from ungrouped data. 


Median 


The median differs from the mid-score in that it is always a calculated 
value. It is defined as a theoretical point above and below which exactly 
50 per cent of the cases lie. The mid-score, on the other hand, may be an 
actual score with something less than 50 per cent of the cases on one side 
or the other. This is always so when the number of cases is odd. When 
it is even, the mid-score is a point between the upper and lower half of the 
scores. For example, if one had six papers scored 80, 70, 60, 50, 40, and 30, 
the mid-score would be halfway between 50 and 60, or 55. 

In Table XI (page 396) are shown the details of the method for calculat- 
ing the median from a frequency distribution. The steps shown in this 
table are the following: 

1. Divide the total number of cases by two. This gives the half sum or 
the number of cases above and below the middle of the distribution. In 


“this example N = 24.5. 


2. By inspection determine in which class interval this point will be. 
Since there are only 22 cases below the interval 40-44, and since the num- 
ber of scores below the next interval, 45-49, is 33, we know the 24.5 point 
must be somewhere in the interval, 40—44. ; 

3. Subtract the number of cases below this interval as shown in the 
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Determination of Mid-score and Semi-interquartile Range 
Using Ungrouped Data: 25 Arithmetic Test Scores 
and 25 Reading Test Scores 


Arithmetic Reading 
Scores Scores 


52 

46 

45 

44 

42 

39 ) 

38 «— Q; (Approx.) nearest whole number — 86 
38 above which 2596 of cases 

36 N_ 25 - 

E G = 7 = 6.25) lie 


33 
33 
32 4———————— Mid-score ——————>75 


31 
30 
29 
29 


28 
26 <Q: (Approx.) nearest whole number —> 66 


25 below which 25% of cases lie 65 


cumulative frequency column from 24.5. This gives us the number of 
cases needed out of the interval in which the median falls. In this instance 
it is 24.5 — 22 = 2.5 cases. 

4. Divide this number by the total number of cases in the interval. 
Here, this is ll. 23 11— 22, Thus we determine how far through the 
interval we must go to get the proportion of the cases needed. (Remember 
the assumption mentioned earlier that all scores in an interval are evenly 
distributed throughout the interval.) 

5. Multiply this ratio by the size of the 
points for this proportion of the interval. 
score points. 


class interval to ascertain score 
In our problem this gives 1.1 
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6. Add this to the lower limit of the interval containing the median. 
This gives the median: that point below and above which an equal number 


of cases or scores fall. As in the example, it is usually a theoretical point 
rather than an actual score, due to the method used in finding it. 


Table XI 


Calculation of the Median from a Frequency Distribution 
of 49 Science Test Scores 


Median = l.l. + fe" 


LI. =lower limit of class 
interval within which 
median falls 


AwwNN 


Median 
falls in 

this one-half of the scores 
interval 


w 
e 


N 
N 


all scores 


re 


fm = number of scores within 
interval in which median 
falls 

c.i. — size of class interval 


-NVVE 


Median = 39.5 4 


= 39.5 + (25522) 5 


2:39.5-- 1.1 


= 40.6 


(Adapted from Henry E. Garrett, Statistics in Psychology and Educati jtion; : 
Teper ORT Conipang; 1868 BALSO SE) TIN vction, Fourth Edition; New York 
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7. Using the frequency distributions of scores in arithmetic and reading shown 
in Table VIII, page 388, and prepared in Learning Exercise 3, page 392, calculate 
the median in each case. 


The Mode 


There is one other measure of central tendency which should be men- 
tioned. It is a crude, inspectional average called the mode. This may be 


Further Statistical Computations 397 


defined as the score occurring with greatest frequency. In a frequency 
table the mode is the mid-point of the interval having the largest fre- 
quency. In Table XI it is 42. 

The mode is of little importance, statistically speaking, as a measure of 
central tendency. 


Comparison of Mean and Median 


As pointed out in Chapter 3, there are some important differences be- 
_tween the mean and the median. The mean is a weighted average in that it 
is affected by the actual amount or size of every score in the distribution. 
The median, however, is a counting average, or average of position. It is 
not so affected by the size of extreme scores. This was illustrated in Chap- 
ter 3, on page 38. Ina symmetrical distribution * the mean and median 
are identical. Therefore, the degree of asymmetry or lack of symmetrical- 
ness of a distribution can be gauged by the extent to which the mean and 
median differ. 

For most situations in which an easily calculated measure of central 
tendency is all that is needed, the median or even the mid-score serves 
the purpose. On the other hand, if careful statistical analysis is planned 
it is well to calculate the mean, or both the mean and the median. 


MEASURES OF VARIABILITY 


The importance of measures of variability in describing a series of scores 
has already been discussed in Chapter 3. It will be the purpose here to 
supplement that discussion by showing how to calculate the two most 
commonly used measures of variability or dispersion using the frequency 
distribution of 49 science test scores already presented. 


Range 

The range is a rough measure of variability. It is simply the difference 
between the highest and lowest scores in a distribution. It has been used 
earlier in determining the size of class interval. Since the range is based on 
only two scores it is not a very stable measure and is little used in statistical 


work. 


Semi-interquartile Range 
This measure is quite common in educational statistics. It is obtained 
1 A symmetrical distribution is one in which the frequencies on each side of the average 


are the same. Generally these frequencies gradually increase from both ends to the 
middle. The distribution of scores on the science test (page 398) is roughly symmetrical. 
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by taking one-half the difference between the 75th percentile (Q;) and the 
25th percentile (Qı). This can be expressed by the formula Q= Qe 
We have already learned how to calculate the necessary values in connec- 
tion with our calculations of the mid-score and Q in Table X, and the 
median in Table XI. Using the same method we can calculate Q; and Qi, 
the 75th and 25th percentiles, respectively. This is illustrated in Table 
XII, which follows. 


Table XII 


Calculation of the Semi-interquartile Range from a Frequency Distribution 
of 49 Science Test Scores 


Qomm— 


Third Quartile: Q; = 75th Percentile in this 
interval 


[2 
[^] 
eo 


W 
rx) 


o= 
N 
NY 


[22 
re 


First Quartile: Qi = 25th Percentile in this 
interval 


I.I. = lower limit of class interval 
within which quartile point 
falls 

F = sum of all scores below l.l. 

fm = number of scores within 
interval in which quartile 
point falls 


3 
3 
2 
1 
1 
1 


© 


PN Qa Gam 
5 2 


3675 — = = 
(367. 33. 5 (12.25 10,5 48.25 — 31.58 


- 


= 44,5 + 
5 3 2 
= 48.25 ; = 8.33 


One may ask why the formula for Q calls for one-half the range between 
Qs and Qi. This is because a true measure of variability is based on the 
deviations of scores from some measure of central tendency and is ex- 
pressed as a distance on either side of that measure. To express the inter- 
quartile range in somewhat similar terms it is halved. Although the semi- 
interquartile range is not based on deviations of individual scores from an 
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, average, and is therefore not a true measure of variability, it is roughly 
comparable to such measures. In a symmetrical distribution Q may be 
added to and subtracted from the average, and it will include the middle 
50 per cent of the cases. In an asymmetrical distribution there is usually 
some variation from this proportion. 

When a series of scores is quite homogeneous, Q will be smaller than 
when the differences between scores or individuals is greater. This prin- 
ciple is illustrated in Figure 18. f 


Figure 18 


Comparison of Q Values of Two Distributions 
Differing in Spread 


In a the spread of scores is greater than it is in b. Consequently, the 
Q,’s and Q,’s are farther apart and the Q is larger in a than in b. Putting 
the matter in another way, it can be said that it is necessary to take in a 
wider range or variation of scores in a than in b to include the middle half 
(50 per cent) of the cases in each distribution. 


Standard Deviation 

1f we take each score in a class or series separately, find the difference 
between it and the mean, and add these differences without regard to sign, 
our result will be a number which gives some measure of the extent to 
which all the scores tend to vary from the mean. Obviously, if all the 
scores are the same, all of the differences between the scores and the mean 
will be zero, and the variability will also be zero. The larger the sum of 
these deviations from the mean, the greater the diversity or variability. 
This is the principle upon which the calculation of most measures of vari- 
ability or dispersion is based. ; 

In calculating the standard deviation (e), we square the deviation of 
each score from the mean. This has the effect of eliminating minus signs 
from the deviations of scores below the mean and it gives the standard 
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deviation more stability as a measure of variability than any similar meas- 
ure. We further divide the sum of the squared deviations by the number 
of cases or scores, which gives us the mean of these squares. Finally, we 
extract the square root of this mean. The quotient is called the standard 
deviation. 

In Table XIII are shown the steps by which the means and standard 
deviations of the arithmetic scores and the reading scores used in Chapter 
3 (Table II, page 44) were caleulated. For the sake of simplicity, the re- 
sults were used there without a demonstration of how they had been ob- 
tained. Each step of the work is shown in Table XIII, which can probably 
be interpreted without further explanation. 

Once the calculation of the standard deviation from ungrouped data is 
understood, we may proceed to Table XIV (page 402). That table demon- 
strates the method used for calculating c from grouped data, that is, from 
a frequency distribution. Fundamentally, the process is the same as for 
ungrouped data, but corrections must be made for the use of an assumed 
mean and for the use of class intervals. Careful study of this table and 
practice with the exercises following the explanation should make these 
differences in procedure meaningful. 

The steps in calculating the standard deviation are as follows: 

1, 2, 3, and 4 should be familiar. They are the same as described in 
calculating the mean, page 393. 

5. Multiply each entry in the fd column by its corresponding d. This 
gives the fd? values. Enter these in the fd? column. 

6. Add all the fd? entries to get Xf". 

7. Substitute the proper values for each expression in the formula and 
solve for c, the standard deviation. 


Efd 


It should be noted that N is the correction which was used in calculat- 


ing the mean from an assumed origin. Since we follow the same procedure 
here, it is necessary again to make the same correction, but since it is under 
the radical with the m it too is squared. This correction is often desig- 
nated by the small letter c. 

As explained in Chapter 3, the standard deviation is that distance 
which, laid off above and below the mean, will include the middle 68.26 
per cent of the cases or scores. This is exactly true in a so-called normal 
distribution only. In most situations where approximately normal dis- 
tributions are the rule, one standard deviation on either side of the mean 
will usually include about two-thirds of the cases. 

Again, as pointed out with Q, the more variable a group, the larger will 
be the standard deviation or distance on either side of the mean required 


Further Statistical Computations 401 


Table XIII 


Calculation of the Mean and Standard Deviation Using Ungrouped Data: 
25 Arithmetic Test Scores and 25 Reading Test Scores 


ARITHMETIC READING 
Deviations - Deviations 

from from 

Scores Mean Mean 


19.7 i - 1274.49 
13.7 E 5 712.89 
12.7 “ 349.69 
11.7 z i 278.89 
971 i $ 246.49 
6.7 i 2 136.89 
57 . ? 114.49 
57 z . 32.49 
3.7 ei . 22.09 

7 . . 2.89 

7 . . 2.89 

of ` ` Ad 

-3 E . .09 
-.3 . . 5.29 
—2.3 » , 10.89 
—3.3 A 5 28.09 
—3.3 A e 39.69 
—4.3 . . 53.29 
—6.3 . -9.3 86.49 
-7.3 2 —10.3 106.09 
—9.3 x —13.3 176.89 
—10.3 e —16.3 265.69 
—11.3 E —19.3 372.49 
—143 a —27.3 745.29 
—18.3 . —29.3 858.49 
Ed? = 2093.05 zd! = 5923.45 


d — deviations from mean 


M= mean 
N = number of scores or cases 


X = sum 
m = scores or measures in group 
o = standard deviation 


u NY 
25 

= v/236.94 

2154 


to include the middle 68.26 per cent of the scores. This is illustrated in 
Figure 19 on the next page. 


OHE ^. 
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Table XIV 


Calculation of Standard Deviation from a Frequency Distribution 
of 49 Science Test Scores 


2 
c= AM z E) Xici. 
ø = standard deviation 

Ifd? = sum of squared deviations 
of each score from mean 

>fd = algebraic sum of deviations 
of each score from mean 

c.i. = class interval 


AwWWNN— 


= V9.3061 — (45: X 5 


= V9.1036 x 5 
= 15.10 


1 
8 
3 
3 
3 
2 
1 
1 
1 


It will be noted that in both a and b one sigma (1.07) on either side of 
the mean cuts off 34.13 per cent of the cases, but that the standard devia- 
tion in a is considerably larger than in b due to the greater spread or vari- 
ability of the group represented by curve a. It should be emphasized, 
however, that such comparisons are valid only when the same test is ad- 
ministered to two groups or when the same group is tested twice. 

The standard deviation is one of the most important and valuable statis- 
tical measures, though it is a little more difficult to calculate than some 
others. It finds many uses, some of which have already been discussed ; 


Figure 19 


Comparison of o's of Two Distributions Differing in Spread 


a. b. 


TM 
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we shall learn about others shortly. Whenever the most stable and widely- 
useful measure of variability is desired, the standard deviation is the one 
to employ. It is basic to, or enters into, the calculation of many other 
statistical measures. 
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8. Draw a curve similar to the above and show the relationship (approximate) 
of the standard deviation (e) and the semi-interquartile range (Q). Which is larger? 
Ts this always so? Why? 

9, Calculate the means of the scores on arithmetic and reading using the fre-, 
quency distributions made earlier in this chapter. Compare them with the values 


obtained from the same scores ungrouped. 
10. Do the same for the standard deviations. What differences do you find? 


How do you account for them? 


PERCENTILES AND PERCENTILE RANKS 


In Chapter 3 percentile ranks were briefly discussed along with simple 
ranks, primarily to show their superiority to simple ranks in comparing 
standings of individuals when the groups on which the ranks are based 
differ in size. The method for finding the percentile rank of a score, ex- 
plained on page 36, gives only approximate results, and where percentile 
ranks are used extensively more exact methods should be used. Also, since 
tables of test norms are often presented in the form of percentile ranks, we 
need to know how these are obtained. 

By using the formula for the median (which, of course, is the 50th per- 
centile), we can calculate any desired percentile, that is, the score below 
which any given percentage of cases lies, All that is necessary is to sub- 
stitute the desired proportion of cases in this formula for the expression 


E We have already done this for Qi by substituting N (or 25 per cent of 


P 
the cases), which was 12.25, and for Qs by substituting 3N (or 75 per cent 


of the cases), which was 36.75. (See Table XII, page 398.) If the 10th 


percentile were desired we should use ix and so on. 


The simplest and most practical method of arriving at percentiles or 
percentile ranks for any given distribution is to construct a cumulative 
frequency, or ogive, curve. From such a curve percentiles or percentile 
tanks can be read very easily. Such curves also have other uses which 


will be discussed later. 
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There are two methods for constructing such a curve. They are basically 
. the same, the choice of which to use being a matter of preference. The 
necessary calculations for each method are shown in Table XV. 


Table XV 
Two Methods of Calculating Values Needed to Construct 
a Percentile Graph or Ogive Curve 


METHOD | METHOD Il 


cum. f f of N Percentile | cum. f Percentile 
Score Rank 


1 49 100% 745 49 100.00 
48 95% 46.55 65.88 48 97.92 
46 90% 44.1 5975 46 93.84 
44 44 89.76 
41 8096 41 83.64 
38 70% 38 77.52 
60% 5 
33 33 67.32 
50% 
40% 


30% 


20% 


10% 
5% 


0% 


SAMPLE CALCULATIONS 


Method I Melhod IT 
UNE Percentile rank of lower 
5th percentile = l.l. + S x s of interval 5-9 
m 
es =7, X 100 = 2.04 
~95+ (B=) x5 49 
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Under Method I scores have been calculated corresponding to arbitrarily 
chosen percentiles or percents of the total number of cases, just as has been , 
shown previously for the median and Q, and Qs. These points are then 
plotted as shown in Figure 20. They are marked with a circle to distin- 
guish them from the points calculated and plotted by the second method, 
those points being marked with an X. In the latter we have calculated 
the percentile ranks corresponding to certain frequencies, using the total 
number of scores in the distribution as our base. Method I goes from 
percentages to scores, whereas Method II goes from scores to percentile 
ranks. 

It will be noted that the two methods give results that agree exactly, 
except possibly at the two extremes of the curve. At these places it is 


Figure 20 


Percentile Graph Based on 49 Science Test Scores 
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© Designates points determined by Method I. 
X . Designates points determined by Method Il. 
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permissible to end the two curves at the upper limit of the highest in- 
terval and at the lower limit of the lowest interval, calling these points the 
one hundredth and the zero percentiles, respectively. When this is done 
the curves obtained by the two sets of calculations representing the two 
methods will coincide. 

With such a curve one may quickly determine the percentile rank of 
any score by locating the score on the vertical scale, finding the point 
where the curve crosses the line corresponding to that score, and dropping 
a perpendicular (visually) to the base line where the corresponding per- 
centile rank may be read. The process is made clear by the dotted lines 
which have been drawn in the diagram. In the use of such a graph, lines 
such as these are not drawn, of course, but the answers are determined by 
inspection. 

In the curve shown in Figure 20 a score of 37 is found to have a per- 
centile rank of 35; similarly, a score of 57 has a percentile rank of 86. By 
definition, 35 per cent of the scores lie below 37, and 86 per cent lie below 
51. By use of such a curve we may find the percentile rank of any score in 
the series. Percentile graphs are useful in many other ways, a few of which 
are mentioned below. 


1. Besides obtaining percentile ranks visually, we can easily determine 
what score corresponds to any given percentile; thus we may estimate 
the 25th (Q,), 50th (median), and 75th (Q;), as well as any other score cor- 
responding to any percentile from zero to one hundred, directly from the 
curve. 

2. We can determine the percentage of scores which lie between certain 
limits. Likewise, it is easy to estimate the range of scores in the upper 10 
per cent of the group or the range of the middle 20 per cent. 

3. It is possible to construct several percentile curves on the same 
graph, representing distributions for several different groups, such as 
successive grades on the same test or the same group on several different 
tests. With these curves one may compare medians, quartiles, or any other 
corresponding points on the curves, determine percentage of over-lapping, 
and make many other useful comparisons. 

In tables of norms, percentile ranks are usually given numerically for 
every possible score, but these values are most conveniently determined 
by first constructing a curve based on the cumulative frequencies as we 
have done here. 
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11. Construct a percentile curve for the 25 arithmetic scores and one for the 25 
reading scores using Method I for arithmetic and Method II for reading. Are these 
curves like the sample in Figure 20? If not, how do you explain the differences? 


| 


Further Statistical Computations 407 
MEASURES OF CORRELATION OR RELATIONSHIP 


In Chapter 3 graphic illustrations of correlation were presented in the 
form of scatter diagrams or two-way tables. These provided pictorial or 
graphic. representation of the extent of agreement between two variables 
and also showed whether the relationship was positive or negative. In 
each case a coefficient of correlation was also given; this may be defined 
as a quantitative measure of the degree and direction of relationship exist- 
ing between two (or more) variables. It is very important in that it forms 
the basis for gauging the efficiency of prediction. 

As has already been explained, much of the work in educational and 
psychological measurement has prediction as one of its chief and most im- 
portant aims. Whether we are serving as teacher, counselor, or clinician, 
one of the prime purposes of measurement is to enable us to predict. When 
we give a test, whether of intelligence, aptitude, personality, or achieve- 
ment, we are often concerned with getting a more accurate idea of the 
probable accomplishment, success, behavior, or adjustment of the indi- 
vidual pupil. In evaluating a measuring instrument or device with regard 
to its forecasting efficiency, the coefficient of correlation is indispensable. 


Rank Difference Correlation 

Although there are several methods of calculating the coefficient of cor- 
relation, only two will be discussed here. The first and simplest of these 
is based on ranks. Briefly, the theory underlying it is that if two sets of 
scores are obtained on the same population, and if each individual is 
ranked on both, the size of the differences between the ranks gives a 
measure of the extent of agreement between the two tests. To illustrate, 
let us assume three pupils, A, B, and C, have taken two tests, one in geog- 
raphy and one in intelligence. Their scores are: 


A B C 
Geography 25 40 30 
Intelligence 90 100 95 


If we rank them, they take the following order: 


A B [9] 
Rank in Geography Tesl 3 1 2 
Rank in Intelligence Test 3 1 2 


The ranks on the two tests agree perfectly, as may be shown by taking the 
differences between the two sets of ranks, thus: ‘ 
A 


B 
3 1 
uae fat 
0 0 
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Since the agreement is perfect, the correlation will be perfect. 
Now let us suppose the scores, ranks, and differences are as follows: 


A B Cc 
Score in Geography Test » 25 30 40 
Score in Intelligence Test 90 110 100 


If these are ranked they show the following order: 


A B C 
Rank in Geography Test 3 2 1 
Rank in Inlelligence Test 3 1 2 
Differences belween Ranks 0 1 1 


Here we have something less than perfect agreement and the differences 
between ranks are greater than they were in the first case. 

One more illustration will help to clarify this principle. Let us assume 
the following: 


A B [6 
Score in Geography Test 40 30 20 
Score in Intelligence Test 90 100 110 


Rank in Geography Test 1 2 3 
Rank in Intelligence Test 3 Ha. XO 
Differences between Ranks 2 0 2 


In this third case we have a complete reversal of ranks between the two 
tests, and thus the differences are at a maximum total. This illustration, 
though greatly simplified, shows how agreement (or lack of it) between 
ranks gives a measure of the extent of correlation. 

An eminent English statistician, Charles Spearman, worked out a 
method of determining the extent of correlation based on this principle. 
Tt is known as the Spearman Rank Difference Method. Using the two sets 
of scores in arithmetic and reading for John’s class (see pages 35 and 42), 
we shall work out the correlation by this method. The entire procedure, 
based on the formula p(rho) = 1 — NETS I is shown in Table XVI. 

The results of these calculations, a correlation coefficient of .65, show 
that there is a substantial degree of relationship between scores on the 
arithmetic test and scores on the reading test. We can say, therefore, that 
there is a marked tendency for pupils who do well on one to do well on 
the other, and vice versa. The correlation is not perfect by any means, 
and there are individuals who constitute important exceptions to the 
general trend, for example, in such instances as D, R, T, and V. 
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Calculation of Rank Difference (p) Correlation Coefficient: 
Scores of 25 Pupils on Arithmetic Test and Reading Test 


2 


Scores Ranks 
Pupil Arithmetic Reading Arithmetic Reading 


44 
21 
14 
18 
46 
45 
52 
30 
39 
36 
31 
22 
23 
38 
33 
33 


N 


N 
N 


Š 
pow 5-0 HH a-40 
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A 
B 
c 
D 
E 
F 
G 
H 
[ 
J 
K 
L 
M 
N 
o 
P 
Q 
R 
S 
T 
U 
v 
w 
X 
Y 


"m ETUR 
p = rank difference correlation = NN 1) 


_ 6x 920.50 
25(625 — 1) 


iniu 5528 
N — number of cases 75600 


21-.35-.65 


D — differences between ranks =1 


Product-Moment Correlation 

One of the disadvantages of the Rank Difference Method is that it is 
practical only with small groups. If we have large numbers of cases the 
numbers dénoting ranks and the possible rank differences become large. 
When these are squared they become too cumbersome to deal with con- 
veniently. Aside from the matter of convenience, there are other reasons 
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why statistical workers generally prefer another method of correlation 
known as the Pearson Product-Moment Correlation. Let us suppose we 
have measured the heights and weights of some small children and we wish 
to determine whether there is a correlation between these two variables. 
In order to simplify the explanation we shall use only five cases which we 
shall call J, K, L, M, and N. 


Deviations from Mean 
Hep Weight Height Weight 
Y 


z y qt y ay 
J 31 17 1 -2 1 4 -2 
K 27 18 -3 -1 9 1 3 
L 29 16 -1l. -3 1 9 3 
M 31 22 1 3 1 9 3 
N 32 22 2 34 hota 9 6 
16 32 13 

Mean 30 19 


A formula (there are many variations) for the Producl-Momenl Correla- 
lionis: 
Dry 


re 
VIr x Dy? 


Substituting the values obtained for the five cases above gives us: 


c eno ll 
V16 x 32 


L5. 
32005 
= 57+ 


r 


The Product-Moment coefficient of correlation shows the extent to which 
variations of individuals from the respective means of the distribution of 
two traits agree in direction and relative size. For example, M and N are 
both above the means in height and weight; K and L are below the means 
in both. Their zy products are all positive, yielding a positive value of r. 
However, J is above the mean in height but below the mean in weight. The 
xy product is negative which reduces the Zzy and the size of r. If the sum 
of the negative zy's equals the sum of the positive zy's, the Ezy is zero and 
r is zero, showing that there is no consistent tendency for variations of in- 
dividuals to agree. When the sum of the negative xy’s exceeds the sum of 
the positive zy's the correlation is negative, showing that variations of in- 
dividuals on the two traits tend to go in opposite directions though similar 
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in relative amount. Various types and degrees of relationships were shown 
graphically in Figures 1, 2 and 3, Chapter 3. 
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12. Make a table similar to the one above, using five cases that you think would 
give a negative correlation. Prove it. 

13. Using the twenty-five scores in arithmetic and reading (Table XVI), calculate 
the Producl-Moment coefficient of correlation. (Suggestion: To reduce the labor 
of calculation, use 32 as the mean of the arithmetic scores and 75 as the mean of 
the reading scores.) Compare your answer with that obtained by the Rank Differ- 
ence Method. 


A variation of the formula for r which was used above is: 


Dry 
r= 
Nowy 


This is useful if the standard deviations of each variable. (in this case, 
arithmetic and reading test scores) have already been calculated. All that 
needs to be done in addition, then, is to calculate the xy products, substitute 
the different values in the formula, and solve for r. 

The calculation of r from grouped data is more complicated than it is as 
shown here with use of actual scores. For further information on this 
procedure textbooks in statistical methods may be consulted, A number 
of so-called “Correlation Charts” have been devised to make this task 
simpler and more mechanical. These can be obtained from publishers of 
standardized tests, and they are often very useful, especially if one has a 
large number of correlations to do. 


Uses of Correlation 

Modern work in the field of measurement would be impossible without 
the use of correlation methods. Although it is not appropriate here to 
go into details concerning the uses of correlation in measurement, a few 
examples may be given. 

In Chapter 4 the criteria of a good measuring instrument were discussed. 
It was stated. that of all such criteria, validity is the most important, and 
different approaches to validating a test were described. One of these 
was empirical or statistical validity. The extent to which a test (let us say 
of intelligence) correlates with a criterion, that is, with some accepted 
measure of intelligence, is a measure of its statistical validity. It is obvious 
that if the criterion is a valid one and the test under scrutiny does not cor- 
relate with the criterion to any noticeable extent, it cannot be regarded as 
having statistical validity. A test supposed to measure intelligence of 
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ten-year-old children, scores on which show the following correlations with 
other criteria of intelligence, would hardly be said to have statistical 
validity: 


With school marks With Stanford- 
Correlation of supposed over several years Binel mental ages 
test of intelligence AT .28 


Such a test would be open to strong suspicion as a measure of intelligence, 
though it might conceivably prove to have validity as a measure of some- 
thing else. 

In situations where statistical measures of yalidity are required, the 
coefficient of correlation is the measure most frequently used. The cor- 
relation between scores on the measure or test whose validity is to be de- 
termined, and some established or generally accepted measure of the same 
quality or trait which the new test purports to measure, is a standard 
statistical criterion of validity. 

For most standardized tests of intelligence such correlations are usually 
considered acceptable if they are in the neighborhood of .40. Higher cor- 
relations would be desirable, of course, but in practice they generally do not 
exceed .50. Correlations of this size are not large, and one might question 
the value of a test which correlates only .40 to .50 with a criterion. A group 
intelligence test usually yields correlations of this order with school achieve- 
ment as represented by teachers’ marks. If such marks were measures 
of achievement and nothing else, if school achievement were dependent 
upon intelligence alone, and if marks were perfectly dependable (reliable) 
measures, the correlation would certainly be much higher. Since these 
conditions are not met fully, and generally obtain to only a limited extent, 
correlations between a standard group intelligence test and school marks 
are usually not very high. Yet such correlations may be taken as some 
indication of validity when the factors affecting the relationship are taken 
into account. 


Reliability 

Another important application of correlation techniques is in determin- 
ing the reliability of tests. This has been defined as the consistency with 
which a test measures whatever it does measure. There are three com- 
monly used methods of determining reliability or consistency of measure- 
ment. The first is to give the same test twice to the same group and 
calculate the correlation between the two sets of scores. The second is to 
give two equivalent forms of a test to the same group. The third method 
of determining the reliability of a test is to administer it once only, score 
it by the split-half method, correlate the half scores, and apply the Spear- 


i 
ji 
E Y 
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man Brown formula. These methods have been explained and illustrated 
in Chapter 3, and are mentioned again here only as illustrations of the use 
of correlation techniques. 

Reliability coefficients of standardized tests may be expected to reach 
or exceed .90, and some published ones exceed .95. This presents a marked 
contrast with the size of the usual validity coefficients and illustrates the 
inadequacy of interpreting coefficients of correlation on the basis of magni- 
tude alone. Whereas a validity coefficient of .50 is usually quite acceptable, 
a reliability of .50 would be considered unsatisfactory for almost any test. 


Prediction 

Tt was stated earlier that correlation coefficients are useful in determining 
the predictive value of tests. Indeed, it has been suggested that one of 
the basic aims of testing is prediction. Let us see how coefficients of cor- 
relation serve these purposes. 

"The rank difference correlation between arithmetic and reading scores in 
John's class was found to be .65. Now, suppose we ask the question: How 
can we predict a pupil's score on the arithmetic test, knowing his score on 
the reading test? Knowing a pupil's score on the arithmetic test, how ac- 
curately can we predict his reading score? Answering this question in- 
volves calculations which are beyond the scope of this book, but one of 
the end products is a formula known as the Standard Error of Estimate 
which is cte) = ov V1 — r5 where os.» is the standard error of pre- 
diction of a reading test score, cy is the standard deviation of the reading 
test scores, and r is the correlation between arithmetic and reading scores. 
Substituting our calculated values (see Table XIII) in this formula we have: 


testy = 15.4 VI — (.65)? 
= 15.4 VI — 4225 
= 15.47.5775 
= 15.4 X .76 
Sky 


This tells us that, knowing the score in arithmetic made by John or one 
of his classmates, we can predict his score in reading with an estimated 
standard error of about 12 points, and that the chances are 68.26 out of 
100 that our prediction will not be in error by more than 12 points either 
way. In other words, if the most probable score on reading is 75 for a 
person scoring 32 on arithmetic, the chances are about 2 to 1 that his actual 
score in arithmetic will fall between 75 — 12, or 63, and 75 + 12, or 87. 
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Similarly, we may wish to know what the standard error of an obtained 
score is when we have given a test of known reliability. To put the prob- 
lem in another way, what is the probability that a pupil’s true score on a 
test does not differ significantly from the score which he actually obtains? 
This is another way of asking how reliable a single test score is. 

Here again, we have a formula which is based on correlation, this time 
on the reliability coefficient. It is dmeas. = c V1 =r, where emos, is the 
standard error of measurement, c is the standard deviation of the test, and 
r is the reliability coefficient of the test. 

To illustrate the use of this formula, let us assume that we have deter- 
mined the reliability of the arithmetic test used in John’s class to be .90. 
We know that the standard deviation is 9.2 (Table XIII). Substituting, 
we have: 


Omas, = 9.2 MA - 90 
= 9.2 v 10 

9.9 X .32 

2.9 


D! 


This result indicates that the chances are about two to one that the 
obtained score of any pupil on the arithmetic test will not vary from his 
true score (whatever it may be) by more than three (2.9) points. More 
specifically, the chances are two to one that, John's true score on the arith- 
metic test lies somewhere between 33 and 39 (his obtained score of 36 + 3). 
Furthermore, we can say with much greater assurance (chances about 19 to 
1) that his true score lies between 30 and 42; that is, it does not differ from 
his obtained score by more than six points. 
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14. Assume the following data: 


(a) Correlation between J.Q. and marks in algebra equals .64. 
(b) Reliability of test of intelligence equals .91; of teacher's marks in algebra 
equals :51. 
(c) Standard deviation of scores on intelligence test equals 16; of teacher's 
marks on basis of 4 for A, 3 for B, 2 for C, 1 for D and 0 for E, equals 1.0. 
Using these data, calculate the standard error of measurement of the intelligence 
test and of the teacher's marks in algebra. 
15. Calculate the standard error of estimate for scores on the algebra test, 
knowing the pupil’s mark. For his mark, knowing his score on the algebra test. 
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THE NORMAL PROBABILITY CURVE 


To introduce the discussion of the normal curve, two methods commonly 
used to depict frequency distributions in graphic form are shown in Figure 
21. Both are based on the distribution of arithmetic scores, Table VIII, 
page 388. 

Figure 21 


Graphs of Frequency Distribution of 25 Arithmetic Test Scores 
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The histogram and the frequency polygon are based on the same data, 
but the method of construction differs, as does the appearance of the two 
graphs. In both, frequencies are represented on the vertical axis, class 
intervals on the horizontal axis. "Thus, the height at any point represents 
the number of scores or cases in the interval directly below that point. In 
the histogram, points are located at the correct height at the beginning 
and end of each interval. These are connected by a horizontal line, and 
by vertical lines to the adjacent points in the next higher and lower inter- 
vals. The graph ends on the base line at the lower limit of the lowest in- 
terval and at the upper limit of the highest interval of the distribution. 

‘In the frequency polygon, points representing the frequencies of each 
class interval are located directly above the middle of the respective class 
intervals. These points are connected with straight lines, The graph 
ends at the base line, as in the case of the histogram, but there is one differ- 
ence in this respect. It is customary to end the frequency polygon at the 
mid-point of the interval just below the lowest one in the frequency dis- 
tribution that contains any cases, and at the mid-point of the next highest 
interval above the highest one in the frequency distribution. 

These two graphs have certain important features in common which 
concern us in considering the normal curve. First, they have a form which 
is generally referred to as humped or bell-shaped. This results from the 
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fact that frequencies are much smaller at the extremes than at the middle. 
That is, the number of cases increases more or less steadily as we go to- 
ward the middle or average from very high or very low scores. Second, 
the curves or graphs are continuous. There are no gaps, or class intervals 
with zero frequencies. These two features or characteristics are common 
to all so-called normal curves.” 

The normal curve is a limiting curve which is approached by many dis- 
tributions when a large number of measurements is made, or, as we say, 
when there is a large number of cases. Tt is necessary to assume, further- 
more, that these measurements or cases are taken at random, or that there 
is no bias or systematic error. For example, if it were desired to take an 
unbiassed and representative sample of students on a given college or uni- 
versity campus, it would be necessary to plan the sampling procedure in 
such a way that every student would have an equal chance of being chosen. 
1f these conditions were met the sample would be an unbiassed and repre- 
sentative one. 

One of the usual ways of illustrating the normal curve is by tossing coins 
or dice. If we represent “heads” by H and "tails" by T, the expression 
H + T represents the probabilities for any toss of one coin, namely equal 
probabilities of a head or a tail. If we toss the coin 100 times, the results 
will approximate 50 heads and 50 tails. If we toss two coins, the possibilities 
are two heads, head and tail, tail and head, two tails, or, H? + 2HT + T°. 

If we toss two coins 100 times we would (theoretically) get 25H? + 50HT 
+ 257%, or both coins heads 25 times, one head and one tail 50 times, and 
both coins tails 25 times. Similarly, we can predict the theoretical fre- 
quency with which each possible combination of any number of coins 
tossed simultaneously any given number of times will occur. 

In tossing coins, if there is an equal chance for each coin to fall head or 
tail each time, every possible combination can occur, but the probabilities 
of getting ten heads or ten tails when we toss ten coins are less than those 
for getting other combinations. The most probable combination is five 
heads and five tails, since each coin has an equal chance of falling heads or 
tails. By expanding (H + T)? we get the probabilities of each possible 
combination occurring if ten coins were tossed an infinite number of times. 
The expression becomes H! + 10H?T + 45H®T? + 120H'T? + 210H*T* + 
252H5T* + 210H1T5 + 120 H*T* + ASH?T* + 10HT? + TY. 

This means that the chances of getting five heads and five tails in tossing 
ten coins are 252 in 1,024. The probabilities of getting ten heads or ten 


2The term normal has a mathematical connotation which has no connection with 
normal and abnormal as used in psychology or education. The equation for the normal 
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tails are, respectively, one in 1,024. Where ten coins have thus been tossed, 
it has been found that the frequencies with which possible combinations 
do occur approach the theoretical values as limits. For example, Figure 
22 shows the results of tossing ten pennies 1,000 times, based upon an 
actual experiment. 


Figure 22 
Frequency Distribution Based on Tossing Ten 
Pennies 1,000 Times 
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(From Daniel Starch, Educational Psychology; New York: 
The Maemillan Company, 1919. By permission of the 
author.) 


Although the number of tosses is 1,000 instead of 1,024, it is evident 
that the frequency with which each possible combination actually occurred 
closely approximated the theoretical values. It is presumed that under 
ideal conditions, that is, an infinite number of tosses with each toss exactly 
like every other, and each coin free to fall heads or tails as chance dictates, 
the actual frequencies would coincide with the theoretical. 

Another situation in which the typical bell-shaped distribution curve 
occurs is in measurement of natural phenomena. Thousands of such meas- 
urements have been made of barometric and temperature readings over a 
long period at a given locality; of height, weight, and other bodily measure- 
ments of humans of the same sex and age; of the distribution of errors of 
measurement which are due to chance; and of measures of ability and 
achievement, particularly when these are objective and based on large 
numbers of cases. In Figure 23 is shown such a distribution curve for 
stature of men. 

In Chapter 10 the distribution of I.Q.’s based on the composite of Forms 
Land M of the Revised Stanford-Binet is reproduced in Figure 7. 

Figure 24 shows the distribution of scores on an objective examination 
in educational measurement. 
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Figure 23 


Frequency Distribution of Stature for 8,585 Adult Males 
Born in the British Isles 
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Figure 24 


Frequency Distribution of Scores by 138 College Students on 
an Objective Final Examination in Educational Measurement 
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All these distribution curves approximate, more or less closely, the 
theoretical frequency curve. The larger the number of measurements, 
the more symmetrical the curves become. 

The significance of these results for educational and psychological meas- 
urement is great. In the first place, to the extent that human abilities 
tend to be distributed normally, we may expect to find relatively small 
proportions in the total population that are very gifted or extremely 
lacking in ability, and conversely, we may expect the great majority to 
cluster around the average. 

In the second place, we may expect to have few, if any, gaps or breaks in 
the distributions of such measurements. We do not find classes or types in 
nature, but rather, all gradations from the lowest to the highest. This has 
particular significance in-view of the widespread tendency to classify peo- 
ple into types. We frequently encounter systems of classifying individuals 
into personality types, or physical types, or on some other basis. It is well 
to remember that human beings do not naturally fall into types or groups 
on the basis of traits such as intelligence, personality, or achievement. 
When we do group them it is generally for administrative reasons or reasons 
of convenience which, while important and often necessary, should not 
blind us to the fact of continuity in the distribution of human traits. j 

Finally, the concept of the normal distribution is very important to 
educational statistics, and therefore to educational measurements. Most 
statistical measures — the standard deviation, for example — are calcu- 
lated by methods which assume a normal distribution. More particularly, 
techniques for estimating the accuracy of measurements rest upon the con- 
cept of the normal probability curve. As we have empasized, the usefulness 
of tests and evaluative techniques depends in part upon their value for pre- 
dictive purposes. Such instruments as intelligence, aptitude, and prognos- 
tic tests are essentially tests for predictive purposes. Prediction is based 
squarely on the concept of probability. Given a certain score on a certain 
test, what will be the probable score if the test is given again? Or, what 
is the probable score on another test of the same or similar abilities, traits, 
or potentialities? Or, what are the probabilities that an individual who 
makes a certain score on a certain test will be successful in a chosen pro- 
fession? Again, what are the chances (probabilities) that the true score of 
an individual on an examination differs from the score he actually made, 
and by how much? The answers to these and many similar questions 
depend upon the normal probability concept. 

Tt has been emphasized and should always be remembered that the con- 
cept of the normal distribution has definite limitations as well as great - 
usefulness in educational measurement. A common error is to apply it in 
circumstances where it is inappropriate. This was pointed out in Chapter 
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14 in connection with marking. However, recent advances in statistical 
methodology make possible certain corrections for various conditions 
resulting in departures from the normal distribution, and these widen the 
applicability of methods based on the assumption of normality. On the 
whole, the idea of the normal distribution is a very useful and even in- 
dispensible one in educational and psychological measurement and research. 
It is difficult to see how the present levels of development in these fields 
could have been attained without it. 
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16. Write a paper of about 500 words explaining how you would use the statisti- 
calideas and methods presented in Chapter 3 and Appendix A in your work as a 
teacher or counselor. Give illustrations and examples for each use. 
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A Selective List of Test Publishers 


in the United States 
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This list includes those publishers or other organizations whose tests are 
referred to in this book. The publishers whose names are preceded by an 
asterisk (*) issue catalogs devoted entirely or in large part to tests. 


* Acorn PUBLISHING COMPANY, Rockville Centre, New York 

* Bureau or EDUCATIONAL MEASUREMENTS, Kansas State Teachers Col- 
lege of Emporia, Emporia, Kansas 

*Bureau or EDUCATIONAL RESEARCH AND Service, State University of 
Iowa, Iowa City, Iowa 

* Bureau or PusricATIONS, Teachers College, Columbia University, New 
York 27, New York 

* CALIFORNIA TEST BUREAU, 5916 Hollywood Boulevard, Los Angeles 28, 
California 

* CENTER FOR PSYCHOLOGICAL SERVICE, George Washington University, 
Washington 6, D.C. 

* COOPERATIVE Test DIVISION, EpUCATIONAL TESTING SERVICE, 20 Nassau 
Street, Princeton, New Jersey 
EpucationaL RECORDS BUREAU, 21 Audubon Avenue, New York 32, 
New York. 

* EDUCATIONAL TEST BUREAU, 720 Washington Avenue, S.E., Minneapolis 
14, Minnesota 

* C. A. Grecory Company, 345 Calhoun Street, Cincinnati 19, Ohio 
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* Houcuton Mirrurn Company, 2 Park Street, Boston 7, Massachusetts 
INTERNATIONAL TEXTBOOK ComPaNy, Scranton, Pennsylvania 
McKuronur Anp McKnicat Pusitsuine Company, Bloomington, Illinois 

* Onro ScmoransurP Tests, Ohio State Department of Education, Colum- 
bus, Ohio 
Prrsonnet Press, Inc., 188 Nassau Street, Princeton, New Jersey 
PersonneL REsEAnCH InstrruTe, Western Reserve University, Cleve- 
land, Ohio 

* PSYCHOLOGICAL Corporation, 522 Fifth Avenue, New York 36, New 
York 
PSYCHOLOGICAL SERVICE CENTER Press, 1275 New Hampshire Avenue, 
N.W., Washington 6, D.C. 

* Pustec ScHoonL Pususame Company, 509-513 North East Street, 
Bloomington, Illinois 

* SCIENCE RESEARCH ASSOCIATES, Inc., 57 West Grand Avenue, Chicago 
10, Illinois 

* SHERIDAN SuPPLy Company, P.O. Box 837, Beverly Hills, California 

* STANFORD UNtversity Press, Stanford, California 

*C. H. Srornrmvc Company, 424 North Homan Avenue, Chicago 24, 
Illinois 
Unrrep States EMPLOYMENT Service, Department of Labor, Washing- 
ton 25, D.C. 


*Wonrp Book Company, 313 Park Hill Avenue, Yonkers-on-Hudson 5, 
New York 
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Prognostic Test of Mechanical Abili- 
ties, 266 

Prognostic tests. See Aptitude bat- 
teries, and Aptitude tests 

Progressive Education Association, 28 

Projective test, 284—286 

Promotion, 358 

Publishers of tests, 63—64, 421 

Publishers’ catalogs, 62, 421—422 

Pupil Adjustment Inventory, 303 

Pupils, classification of, 354-359 


Questions, test: 
arranging, 139-147 
types of, 112-136 
Quotients: 
achievement, 55 
arithmetic, 55 
educational, 54—55 
intelligence, 52-54, 245-251 
reading, 55 


Range, defined, 397 
Rank. difference correlation coeffi- 
cient, 407—409 
Ranking: 
simple, 35-36 
percentile, 36-37, 403-406 
Rasmussen General Mathematics 
Test, 233 
Rasmussen Trigonometry Test, 233 
Rating scales, personality measure- 
ment, 298-304 
Raw score, 52 
Read General Science Test, 227 
Reading: 
measurement practices in, 166-178 
three areas of measurement, 167- 
168 
Reading quotient, 55 
Reading readiness test, in measure- 
" ment program, 323 
Reading tests, 168—178 


Recall type of question, 113 
Recognition type of question, 113 
Record, permanent, 341-342 
Record forms, 341, 342 
Recording test results, 342-343 
Relationship measures. See Correla- 
tion, measures of 
Reliability, measuring instrument, 
methods for determining, 67-73, 
152-153, 412-413 
Reliability coefficient, 71 
interpretation of, 73, 414 
Remedial work, diagnosis, 365 
Reproducing the test (teacher-made) , 
144-147 
Results of tests: 
analyzing, 148-153, 343 
interpretation, 83-84 
recording, 342-343 
Retardation, 358 
Revised Stanford-Binet, 239-242, 247, 
248, 252, 253, 254, 
Rorschach Ink-Blot Test, 26, 29, 284 


Scatter diagram, 46, 47 
Scholastic aptitude. See Intelligence 
Scholastic aptitude tests, 236 
School boards, 17 
School staff, improvement of, 381 
Schools, interpreting to community, 
379-381 
Schrammel-Reed Solid Geometry 
Tests, 234 
Science, natural: 
achievement tests, 200-204 
measurement in, 195 
S.R.A. Achievement Series, 165 
S.R.A. Junior Inventory, 283 
S.R.A. Mechanical Aptitudes, 267 
S.R.A. Primary Mental Abilities, 263 
S.R.A. Youth Inventory, 284 
Score points, 395 
Scores, test: 
concepts and tools for interpreta- 
tion, 34-61, 387-420 
derived, 52 
percentiles, 36-37 
raw, 52 
sigma, 41, 44, 45 
simple ranks, 35-36 
standard, 43—44 
transmuted, 52 
T-scores, 44 
Z-scores, 44 


Subject Index 


Scoreze, 198 
Scoring, tests, 337—341 
completion questions, 121-122 
correcting for chance, 123-124 
ease of, 81-83 
essay type, 114-118 
multiple-choice questions, 131 
objective type, 112, 115 
short-answer questions, 119, 121- 
122 
subjective type, 112 
suggestions for efficient, 338-339 
true-false questions, 123 
Scoring key, preparation of, 147 
Seashore Measures of Musical Talents, 
76, 268-269 
Seattle Algebra Test, 234 
Seattle Plane Geometry Test, 234 
Secondary school: 
achievement tests, 214—234 
problems of producing standardized 
tests, 204—205 
survey batteries, 206-214 
Selective Art Aptitude Test, 270 
Self-correlation, for determining re- 
liability, 71 
Self-report technique: 
attitudes, measurement of, 291- 
294 
interest inventories, 286-291 
personality inventories, 278-284 
projective techniques, 284-286 
Semi-interquartile range, 40, 297-298 
calculation from frequency distri- 
bution, 398 
determination of, 394—395 
as a measure of variability, 40-41 
Shaycoft Plane Geometry Test, 
234 
Short-answer questions, 118—120 
Short Employment Tests, 268 
Sigma scores, 41, 44, 45 
Small school, measurement program, 
344 
Snader General Mathematics Test, 
234 
Social studies: 
achievement tests in, 195-200, 204, 
219-224 
measurement practices and tech- 
niques, 196-200 
trends, 195 
Sociogram, 311-314 
uses of, 314-316 


435 


Spearman-Brown Prophecy Formula, 
70, 72, 412-413 
Spelling: 
measurement practices and tech- 
niques in, 166, 189-191 
tests, 190-191 
Split-halves correlation, 72-73 
Standard deviation: 
defined, 41, 42, 400 
calculation of, 399—402 
Standard scores, 43—44 
Standardized tests, 4—5 
adequate norms, 85 
administering, 331—337 
defined, 4n 
equivalent forms, 86 
ethics of using, 328-329 
and nonstandardized, 85, 86 
scheduling, 329—331 
scoring, suggestions for efficient, 
338-341 
secondary schools, problems of pro- 
ducing for, 204-205 
selecting and obtaining, 324-329 
versus teacher-made tests, 108-110 
Stanford Achievement Test, 25, 86, 
157-163, 177, 187, 200, 202, 355 
Stanford-Binet (1916), 252. See also 
Revised Stanford-Binet 
Stanford Revision of the Binet Scale 
239 
Statistical techniques, for analysis of 
tests, 343 
Stenquist Mechanical Aptitude Tests, 
26n, 265-266 
Strong Vocational Interest Blanks, 
287-288 
Stroud, Hieronymus and McKee Pri- 
mary Reading Profiles, 177 
Study of Values, 284 
Survey batteries: 
elementary grades, 156-165 
nature and purposes, 207 
secondary schools, 207-214 
usefulness and limitations, 207 
Survey tests, historical significance, 25 
Symmetrical distribution, 397n 


Teacher-made tests: 
assembling the test, 139-148 
analyzing results, 148-153 
construction of, 111-137 
planning, 109 
versus standardized tests, 108-110 
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Teaching, five essential processes in, 
90 
Terman Group Test, 25 
Terman-McNemar Test of Mental 
Ability, 264 
Test construction: 
analyzing results, 148—153 
assembling the test, 139-147 
basic principles, 111—113 
objective test questions, general sug- 
gestions, 134-136 
objectives, function of, 91-92, 96- 
106 
questions, types of, 112-134 
secondary grades, problems, 204— 
206 d 
Test description, check list for, 155— 
156 
Test maker, basic qualifications of, 
110 
Test manuals, 63 
Test of Application of Principles in 
Physical Science, 227 
Test of English Usage, 219 
Test of General Proficiency in Field 
of Natural Science, 227 
Test of Mechanical Comprehension, 
267 
Test publishers, 63—64, 421—422 
Test questions, types of, 112-136 
Test results: 
analyzing, 148-153, 343 
interpreting, 83-84 
recording, 342-343 
Test scores: 
concepts and tools for interpreta- 
tion, 35-45, 387-420 
derived, 52 
percentile rank, 36-37 
raw, 52 
sigma, 41, 44, 45 
simple ranking, 35-36 
standard, 43—44 
transmuted, 52 
T-scores, 44 
Z-scores, 44 
Testing: 
development of, 23-28 
economy, 86—87 
frequency and grade levels, 321— 
322 
meaning, 12 
physical conditions, 334—335 
place of, 330 
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scheduling, 329-331 

time of year, 321 
Testing program, 27 

minimum school 

wide, 323—324 

See also Measurement program 
"Tests: 

achievement. See Achievement tests 


or community- 


administering, 80-81, 326, 331- 
337 

aptitude. See Aptitude tests 

army, 24-25 


construction. See Test construction 
costs, 86—87, 156n 
development after World War I, 
24-25 
diagnostic, 328-329, 362—365 
information on, 61—66 
intelligence. See Intelligence tests 
motivation, used for, 374—376 
objectives as basis of, 90—106 
objectivity in, 79-80 
planning, 109-112 
requirements of good, 66—87 
scheduling time and place, 329- 
331 
scoring. See Scoring, tests 
selecting and obtaining, 324-325 
standardized. See Standardized 
tests 
tools of measurement, 318 
Tests of General Education Develop- 
ment, 214 
Thorndike Scale for Hand-Writing of 
Children, 179 
Three R’s, measurement of, 165-195 
Thurstone Interest Schedule, 291 
Tool-subjects, 166 
Traxler High School Reading Test, 
177 
Traxler Silent Reading Test, 177 
Trigonometry test, 232, 233 
True-false questions, construction and 
scoring, 122-125 
T-scores, 44 
Turse Shorthand Aptitude Test, 268 
Two-factor theory of intelligence, 244 


Understanding American History, 224 

Understanding of Basic Social Con- 
cepts, 207 

University of Kansas Bulletin of Edu- 
cation, 356n, 357n 
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University research and service bu- 
reaus, historical importance of, 
64-65 

Use of Sources of Information, 207 


Validity: 
coefficient of correlation used to 
determine, 77, 411, 412 
curricular, 74, 75 
empirical, 77, 78 
item-discrimination as a measure of 
establishing, 149-150 
logical, 76, 77 
meaning of, 73, 74 
measuring instrument, 73-87 
methods of determining, 74-78 
and reliability, 74 
teacher-made tests, 148-149 
Variability: 
common measures of, 40-43, 397- 
403 
range, 40, 397 
semi-interquartile range, 40-41, 
397-398 
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standard deviation, 41-44, 399- 
403 
Vertical classification, 355 
Vineland Social Maturity Scale, 304 
Vocational guidance, 29-30 


Wechsler Adult Intelligence Scale, 
254, 262 
Wechsler-Bellevue Intelligence Scale, 
253, 262 
Wechsler Intelligence Scale for Chil- 
dren, 254n, 262 
Whole child concept, 29 
Wisconsin Cooperative 
Planning Program, 94 
Wisconsin Inventory Tests in Arith- 
metic, 188 
Word-association test, 284—285 
Writing, 165, 166 
measurement practices in, 179 


Education 


Z-score, 44 
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